Jump to content

Can AMD survive Bulldozer's disappointing debut?


nsane.forums

Recommended Posts

Xj7TR.jpg

AMD's long-awaited Bulldozer processor finally hit the market this week, and the Web has been flooded with benchmark results. One thing is clear: this won't kill Intel's Sandy Bridge, as some were hoping. Indeed, in some tests, Bulldozer can't even keep up with its predecessor. The launch of the Phenom in 2007 was similarly underwhelming—it arrived late, broken, and slow—but AMD managed to turn things around with Phenom II to produce a viable competitor to many of Intel's processors.

AMD's future success will depend on the company's ability to make lemonade from the Bulldozer lemons. And its ability to do that will be governed by the Bulldozer architecture: is it fundamentally flawed, or are the performance issues merely teething trouble?

It could go either way. With Phenom, the problems were fortunately not fundamental. The biggest single issue was that the cache used for supporting virtual memory was buggy (a problem known as "the TLB bug"). A BIOS fix to work around the bug and correct the processor's behavior was released, but it exacted a severe performance penalty. This bug was fixed partway through Phenom's life, adding another 10 percent to the processor's performance. In late 2008, Phenom II was introduced, boasting substantial improvements in clock speed and a much larger level 3 cache. The K10 architecture used in both Phenom and Phenom II was essentially sound; AMD just had to work out some relatively minor problems before it could achieve its potential.

Contrast this with Intel's Prescott Pentium 4s. Prescott was substantially modified from its predecessor, Northwood, with a much longer pipeline, larger cache, and new instructions. However, it didn't boast consistent performance gains over Northwood, largely because it never achieved the clock speed targets it was intended to reach. The lack of clock speed meant that the processor could never offset the penalties incurred by the long pipeline. The problems Intel faced with scaling its Pentium 4 designs eventually gave the company no option but to abandon the architecture entirely.

Intel, thanks to a combination of massive manufacturing capacity, deep pockets, and multiple design teams, could weather the storm. With the introduction of the Core 2 Duo line, Prescott was abandoned, and Intel has held the performance crown ever since. AMD's position is a whole lot more precarious. The company lacks Intel's riches, so a failed architecture that it can't monetize and evolve over a period of many years could be fatal.

AMD's fortunes may depend on whether Bulldozer is another K10—or whether it is AMD's Prescott.

A brave new design

The Bulldozer architecture is arguably AMD's first radically new architecture since the introduction of the K7 Athlons way back in 1999. Both K8, which added 64-bit and integrated the memory controller, and K10, which added single-chip quad core, more cache, and a host of changes to improve instructions per cycle (IPC), can trace their lineage back to the K7. Bulldozer is something new.

For all the low-level detail of how Bulldozer works, Dave Kanter's write-up at Real World Tech is your best bet. If Kanter's article is a little too low-level, a higher-level overview can be had at Tech Report. I'm not going to talk about every part of the processor's design here, but a number of key points are worth picking out for discussion due to the way they reflect AMD's vision.

xeZUd.png

The Bulldozer design has been influenced by AMD's long-term beliefs about the way processors should be built. First, the company believes that workloads will become increasingly multithreaded; processors should be optimized for multithreaded throughput—more concurrent threads—rather than single-threaded performance.

Second, it believes that heavy floating point tasks shouldn't be done on the CPU at all. They should execute on GPUs. This belief underscores AMD's Fusion strategy: the integration of CPU cores and GPU cores into accelerated processing units (APUs) so that mathematical tasks can use the GPU cores.

For Bulldozer specifically, additional design influences came into play. In the words of Chief Architect Mike Butler, AMD's goal was to "hold the line" on IPC (presumably meaning to keep it at around the same level as in Phenom II) but to increase the clock speed, thereby achieving improved single-threaded performance, too. The processor also had to be power efficient.

Taken together, these goals explain just about every aspect of Bulldozer's design.

Trade-offs

Bulldozer is based around processing modules, but describing these modules introduces some terminology problems. Like a processor core, the modules include a front-end that fetches and decodes instructions, level 1 and level 2 cache, a branch predictor, out-of-order instruction schedulers, integer and floating point pipelines, and back-ends to retire instructions. Each module can run two threads simultaneously, and here's where the complexity lies. Like Intel's Hyper-Threading, where the two threads share all the resources of the core, Bulldozer modules include dedicated integer pipelines, each with their own scheduler and retire unit. For integer-heavy code, the result is that Bulldozer is more like two independent cores than it is one; for floating point-heavy code, it's more like one core with hyperthreading.

The first Bulldozer design, codenamed Orochi, includes four modules (and therefore, can handle eight threads at a time), a shared 8 MB level 3 cache, four HyperTransport links (though only one is enabled in Zambezi, the desktop-oriented chip; all four are enabled on Valencia, the server part), a dual-channel memory controller, and other miscellaneous support infrastructure.

Eight concurrent threads provides high throughput for highly multithreaded applications. The belief that floating point-heavy workloads should use the GPU justifies the separate integer/shared floating point design. With floating point heavy lifting performed by the GPU, it no longer matters that two threads have to share access to the floating point unit.

The implications of AMD's desire to save power and to boost clockspeeds are lower level and widespread. Compared to K10, Bulldozer has fewer per-thread execution resources, longer pipelines, and slower caches, all as a result of these influences. The modular design—in particular the ability to share the x86 decode units—saves power. x86 is a complicated instruction set, and replicating a decoder for every single core takes a lot of transistors. Bulldozer's decoder is more capable than K10's—it can decode four instructions per cycle instead of K10's three—but those four instructions are now potentially sourced from two threads, meaning that Bulldozer's decode bandwidth can effectively be lower than that of K10.

YdWe4.png

A similar regression can be seen in each integer pipeline. The core elements of the integer pipeline are arithmetic logic units (ALUs), used for performing integer arithmetic, and address generation units (AGUs) that calculate memory addresses for the reads and writes that the processor must perform. A K10 core has three ALUs and three AGUs. Bulldozer discards one ALU and one AGU, having just two of each in each of its integer pipelines. AMD claims that the K10's third AGU was superfluous, only there to make laying out the chip easier (by increasing the commonality between each AGU/ALU pair), but the same is not true of the ALU; K10 could execute up to three integer instructions per thread per cycle. Bulldozer tops out at two.

The situation for floating point is perhaps the worst of all. Each K10 core had three 128-bit floating point units. These could perform x87 scalar floating point, 128-bit SSE vector floating point, 64-bit MMX vector integer, and 128-bit SSE vector integer operations. Bulldozer has four units in its floating point pipeline. Two are for integer operations (64-bit MMX and 128-bit SSE); the other two are for floating point. In addition to the scalar x87 and vector SSE instructions, the two floating point units can be ganged together, to perform new 256-bit Advanced Vector Extensions (AVX) floating point instructions. Given that this pipeline is now shared between two threads, it's a big reduction in per-thread execution resources.

Not everything has fewer resources; the instruction buffers used for out-of-order execution are larger, meaning that Bulldozer has more instructions eligible for execution. This should allow it to fill its pipelines on a more consistent basis. Bulldozer also supports some potent new instructions. It has AVX, but also features some AMD-specific ones such as a combined FMA ("fused multiply add") instruction that performs a floating point addition and multiplication in a single instruction, which can double floating point throughput for code that can use it. But for code that already dispatched more than two instructions per cycle, and which doesn't use the new instructions, Bulldozer can definitely fall behind its predecessor.

The quest for higher clock speeds also caused AMD to lengthen the pipeline. The company has not disclosed the actual length, but it's estimated at around 20 stages compared to the low-to-mid teens for K10 and Sandy Bridge. Longer pipelines are, all other things being equal, easier to run at higher clock speeds, but they also mean that the penalty when a branch is incorrectly predicted is higher. Similarly, the cache and main memory latencies are longer than they are for K10 (four cycles compared to three for level 1 cache; 21 cycles compared to 14 or 15 for level 2; 65 compared to 55 or 59 for level 3; and 195 versus 182 or 157 cycles for main memory). K10's latencies were already worse overall than Sandy Bridge's (which boasts 4, 11, 25, and 148 cycle latencies, from level 1 through to main memory), and Bulldozer makes them worse still.

Again, the news here isn't all bad; the caches are larger than in the older processors, and they offer more bandwidth. For some workloads, this will work in Bulldozer's favor—but it's a trade-off.

Bulldozer's benchmarks

All of these design decisions make their presence felt in different benchmark workloads, making Bulldozer's performance quite a mixed bag. The 4-module, 8-thread 3.6GHz FX-8150 ("Zambezi") generally leads the 3.3GHz 6-core, 6-thread Phenom II X6 1100T ("Thuban") K10—but it generally trails the 3.3GHz 4-core, 4-thread Core i5-2500 ("Sandy Bridge").

Certain tests pick up specific differences. For example, Anandtech's N-Queens test—though of little value as a general benchmark—is extremely branch-heavy and will stress branch predictors and highlight penalties imposed by long pipelines. Bulldozer shows a significant reduction in single-threaded performance relative to the Phenom II. So great is the drop that even when run in multithreaded mode, the eight concurrent threads on Bulldozer can't keep up with the Phenom II's six threads, or even the Intel chip's four.

Moving on to more realistic tests, we see something similar in floating point-intensive multithreaded Cinebench rendering, run by both Anandtech and Tech Report. Bulldozer's single-threaded score trails Phenom II's slightly, and is at a huge disadvantage relative to Intel's processors. The eight threads make up for the discrepancy this time, as Bulldozer pulls ahead of Thuban in the multithreaded benchmark.

These benchmarks in many ways show the two extremes of Bulldozer's performance. In single-threaded workloads, it struggles to keep pace with either its predecessor or Intel's competitor. But when all eight threads run simultaneously, Bulldozer can more than make up for this weaker single-threaded performance. Games tend to lie between the two extremes—they spawn some threads, but rarely as many as eight—with Bulldozer sometimes beating the Phenom II (for example, in Anandtech's Civilization V test or Tech Report's F1 2010 benchmark), other times falling behind (as seen in Anandtech's tests of Dawn of War II and Crysis: Warhead).

These are bad results for AMD. The FX-8150 is more expensive to buy than the Phenom II X6 1100T, yet in typical desktop workloads its performance is no better, and sometimes even worse. The scores are not even altogether surprising: the limited per-thread execution resources, longer pipelines, and slower memory subsystem made inferior performance in these workloads almost inevitable. Game over for AMD?

Dashed expectations

AMD wanted to give Bulldozer higher clock speeds, which led to trade-offs in pipeline length and cache latency. There are shades of the Pentium 4 here—it too had long, narrow pipelines that Intel hoped to clock aggressively—which might seem ominous. But while Pentium 4 faltered due to an inability to reach high enough clock speeds, other processors have been more successful with the concept. IBM's POWER6, in particular, had a narrow pipeline and ran at up to 5GHz. Make the right trade-offs and high clocks and high performance can both be attained.

But Bulldozer doesn't really have high clock speeds, though they are an improvement on K10's. The FX-8150 has a base clock of 3.6GHz, a whole-chip turbo of 3.9GHz, and a peak turbo of 4.2GHz (this last mode allows two modules to run at the elevated speed, provided that the other two are idle). The fastest 6-core Phenom II has a base clock of 3.3GHz with a peak of 3.6GHz, and the fastest 4-core runs at a constant 3.7GHz. Plainly, Bulldozer needs to clock even higher to pull ahead of its predecessor.

QWyZx.png

And indeed, it was meant to. AMD's original plans were for Bulldozer to have a clock speed about 30 percent higher than K10's, which would give it a base clock of around 4.4GHz. At this speed, the difference would be night and day; even in single-threaded workloads, Bulldozer would match or surpass Phenom II, and its eight simultaneous threads would give it an enormous lead in multithreaded workloads. The speed seems just about attainable; Anandtech achieved 4.6GHz and Tech Report hit 4.4GHz. Tech Report's benchmark scores at 4.4GHz are a whole lot more respectable, making the processor much more desirable—but at a huge cost in power usage. HardOCP, which also managed to overclock its 8150 to 4.6GHz, saw whole system power draw increase by almost 200W as a result of the overclock.

AMD's architecture does scale, but at the moment it uses far too much power when doing so. If AMD can get that power usage under control, it casts the architecture in a whole new light. Intel, of course, tried and generally failed in its quest to ramp clock speeds this way. AMD, however, may fare a little better. Bulldozer is being built using Global Foundries' still quite new 32 nm process. AMD announced last month that supplies of its Llano APU, also built on the 32 nm process, were limited due to manufacturing problems. Global Foundries is producing too many chips that are either defective or that offer inadequate clock speeds.

There are also claims that the way AMD designs its processors is to blame. A 2010 forum post purporting to be from a former AMD employee has garnered plenty of attention after Bulldozer's weak performance was revealed. In times gone by, processors were laid out by hand; the individual circuits that perform arithmetic operations were carefully crafted by engineers, with logic gates and even individual transistors being manually positioned. This is complex, time-consuming, and doesn't scale well, but it has one particular virtue: it can produce fast and efficient layouts.

Automated tools to perform this task do exist. The forum post claims that AMD has largely abandoned manual layouts and switched to using these automated tools. However, the post claims that these tools impose something like a 20 percent size and performance penalty, relative to manual layouts.

This would be a nice and neat explanation for why Bulldozer is so big and why it has failed to meet clock speed goals. It's just not clear that it's true. It's certainly likely that AMD is using automated tooling to some extent—for two billion transistors it's hard to avoid—but the same is true of Intel. Sandy Bridge, at about 900 million transistors, may be much smaller than Bulldozer, but it still holds an enormous quantity of transistors, too many for them to all be laid out by hand. For components such as Bulldozer's caches—large, regular, repeating structures that are essentially copy-and-pasted into existence—it's implausible to suggest that AMD didn't use manual layouts. The gains to be had from such work are too big. Dependence on automated layout tools may be a contributing factor to Bulldozer's performance regressions, but it cannot be the whole story.

The software factor

The benchmarks show that Bulldozer's ability to perform well has a substantial dependence on the workload. Some of this is to be expected: increasing single-threaded performance makes everything go faster, whereas adding more concurrent threads only makes multithreaded software go faster. A design that forfeits some single-threaded performance in favor of multithreaded performance is always going to suffer in workloads that don't exploit the design's parallel processing capabilities.

To this well-known factor, Bulldozer adds some additional complexities. Two threads running within the same module contend for certain resources. If a program has two threads that are floating point intensive, those threads will tend to suffer when run on the same module. They will be relatively deprived of floating point resources, thanks to the shared floating point pipeline. Such threads may be better off run on separate modules. On the other hand, if those two threads are integer-heavy and share lots of data between themselves, they may be better off on the same module; that's because they can share data in the (relatively fast) per-module level 2 cache, rather than the (relatively slow) shared level 3 cache. Two floating point threads with lots of shared data? It could go either way.

Cache thrashing is an issue. Consider two programs, each with two threads, scheduled across two modules. If one thread from each program is placed on each module, those threads will stomp over each other's cache entries, since the threads have no data in common. It's better to instead keep the threads belonging to one program on one module, and the threads belonging to the other program on the other module.

Turbo boosting makes this more complex still. The peak turbo speeds require two modules to be idle so that they can be completely powered down. So a program with four threads will run at a higher clock speed if those four threads run on just two modules. On the other hand, that same program will have lower resource contention and, in effect, more cache available per thread if one thread is placed on each module, leaving each module half idle. This forfeits the top turbo tier. Which is better?

AMD says that systems should strive to get the peak turbo speed for maximal performance; with four threads to run, it's better to put them on two modules than three or four, and reap the clock speed rewards. Threads belonging to the same process should be put together if possible. Programs could probably be devised that work better with one thread per module and no sharing or peak turbo speeds, but as a general rule, AMD's advice makes sense.

Windows 7's scheduler, however, treats all Bulldozer threads as equal and equivalent. It will willingly place threads from different processes on the same module, and it will spread threads across modules, preventing the full turbo speed from being achieved. The result is that the processor can fail to reach its peak performance for workloads that do not use all eight available threads.

VtOqt.png

One of the many changes in Windows 8 is an alteration to the scheduler. Windows 8 knows about Bulldozer's module arrangement and will strive to power down modules when it can, allowing higher clock speeds. For thread-heavy workloads, the change makes no difference, but benchmark results from Tom's Hardware show some measurable and desirable gains in tasks that don't use eight threads.

On top of all this, the instruction cache within each module has a flaw that can inflict about a 3% performance penalty whenever threads from different processes are scheduled to the same module. A fix for the issue has been incorporated into the Linux kernel, but the status of other operating systems is currently unclear.

The right design at the wrong time?

Bulldozer has a design that's always going to face an uphill struggle when it comes to running programs with just a few compute-intensive threads, and it's going to face difficulties when running floating point-intensive programs. Bulldozer is very much a product of AMD's long-term worldview: a world of massively multithreaded software with floating point number crunching deferred to GPUs (discrete GPUs in the case of the current Bulldozer parts; integrated or discrete for Piledriver and beyond).

But there's a problem: today's software isn't like that. Typical desktop software doesn't have the ability to scale to arbitrarily many cores. Games, perhaps the most widespread applications with a substantial CPU bottleneck, are becoming more scalable to multiple cores but typically do so only in a limited way. For example, a game may show useful scaling with up to four cores, but no more. And even when performance increases with the addition of extra cores, the gains are rarely linear: the first core may give you 35 frames per second, but it might require another five cores to hit 70 fps. Certainly there are workloads like rendering, which scale essentially perfectly to any number of cores. For these tasks, eight cores are just about eight times better than one. But they remain the exception rather than the rule.

Server applications are much more likely to support lots of threads, but let's be clear here: Bulldozer is not a "server processor." There are server processors in the world; they're called things like "POWER7" and "SPARC64 VII+," and you can't pick them up for $300 at Newegg. AMD and Intel both produce designs that are multipurpose and mass-market, that can scale down into desktop and laptop CPUs and scale up into servers. They have to support a wide range of applications, including those that have not been optimized for the vagaries of their particular design.

Offloading floating point tasks to GPUs is similarly something that just hasn't really happened in the mainstream. For pure number crunching, GPU acceleration can make a lot of sense. The Top500 supercomputer list contains many entries that use GPUs for their floating point-heavy scientific tasks. GPU acceleration does work, it's just rarely as convenient as writing regular code for the CPU.

Taking an existing floating point-intensive program and adapting it to use the GPU is rarely a simple thing. Programs that run on the CPU allow free intermingling of things that CPUs are good at—traversal of data structures, branches, object-oriented (virtual) function calls—with floating point code. With current tools, moving code to run on the GPU instead requires restructuring and rewriting software so that it doesn't have this same mix-and-match structure. Things need to be cleanly divided between portions that run on the CPU and those that run on the GPU.

This situation is improving bit by bit, with projects such as the C++ Accelerated Massive Parallelism library making it much easier to integrate GPU code with regular CPU code, but it will be many years before this kind of technology sees widespread use.

For game developers, there's an added wrinkle to using the GPU for computation: games already use the GPU for graphics. Moving workloads away from the CPU just means overtaxing the GPU.

AMD's dreams may come true, but the change won't happen during Bulldozer's life. The software of today benefits from strong single-threaded performance, and it benefits from giving the CPU plenty of floating point resources. The same will be true of the software of tomorrow. Piledriver, Bulldozer's follow-up, will also be too soon for this kind of software. So will Piledriver's 2013 successor, Steamroller, and its 2014 follow-up, Excavator. Innovation and progress in the computer world is fast in some regards, but extremely conservative in others; just look at the number of people still using the decade-old Windows XP. The kind of revolution that AMD is counting on could easily be ten years away, if it happens at all.

Intel's approach—to have fewer, wider cores (and, with HyperThreading, to share entire cores between threads)—will continue to give its processors the lead in most workloads for the foreseeable future. It will continue to be a much better match to the software that actually exists, rather than the software that AMD would like to exist.

Future promise

AMD's forward-looking design decisions mean that this is a processor that's fundamentally premature, and until single thread performance can be be boosted (for example, by ramping clockspeeds, or by bringing its cache latencies and execution units back in line with those of K10), it's always going to be at a disadvantage relative to Intel's processors. But this needn't be fatal to AMD.

As Global Foundries' manufacturing issues get ironed out, yields should improve and clock headroom should increase. Even with a refined, reliable manufacturing process, it's hard to see AMD shipping processors with a base speed of 4.4GHz—the power draw is astronomical—but if the company could get up to 4GHz base/4.8GHz peak turbo, say, the fast and narrow strategy starts looking a whole lot more sensible. IBM's POWER7 clocks at around that level, so it's certainly within the realm of possibility.

The server-oriented, Opteron-branded Bulldozer parts are not yet available, and these may be more competitive. The headline part there is codenamed Interlagos: two Orochi chips integrated into a 16-thread multi-chip module. Server workloads tend to be integer-heavy, and tend to be multithreaded—precisely the kind of thing at which Bulldozer excels. If Interlagos' 16 threads are enough to offset the weak per-thread performance, Bulldozer should be able to carve out a solid niche in the server space.

Longer term, AMD has started talking up Bulldozer's first revision, Piledriver. Due next year, AMD projects that Piledriver will be about 10 percent faster than Bulldozer currently is. Piledriver will change some of the execution units to support additional floating point instructions, but is not expected to be a major overhaul of the processor's design. Where the 10 percent gain comes from is unclear (you don't gain 10 percent improvements on existing workloads just from adding support for extra instructions), but fixing some of the obvious problems—slow cache, insufficient execution resources—could be the answer.

Windows 8 should be on the market this time next year, further aiding Bulldozer's performance in applications with few threads.

If AMD can get the clock speeds up, if the server benchmark performance is strong, and if Piledriver really delivers a 10 percent improvement, Bulldozer will work out well enough. None of these things are guaranteed to happen, but the outlook is not as bleak as the early benchmarks might suggest. Bulldozer is a stumble, but at the moment it's a recoverable one. There's scope for Bulldozer to get better, and that makes it much more akin to another K10 than it does to a Prescott.

view.gif View: Original Article

Link to comment
Share on other sites


  • Replies 14
  • Views 1.9k
  • Created
  • Last Reply

Lots of work from the NewsHound. Very deserved Like :)

He is also Lite's best friend, too? :D

Link to comment
Share on other sites


i suspect news hound is actually some form of a bot. hence no post count, besides if this were really a person, wouldnt he/she want to see the count and bring it up to the admin? ive noticed this before.

Link to comment
Share on other sites


  • Administrator

News Hound = The NewsMan = News bot. ;)

Link to comment
Share on other sites


Yup. He used to be a Bot then a fairy godmother turned him into a Man but I think he got way too proud of himself that's why he was turned into a dog. :P

Link to comment
Share on other sites


There's a hint here :P

maybe im blind, but where exactly?

It's just a hint! And it's just a little hidden :P

PS: It can be "Viewed"...

Link to comment
Share on other sites


  • Administrator

There's a hint here :P

maybe im blind, but where exactly?

It's just a hint! And it's just a little hidden :P

PS: It can be "Viewed"...

Shhhhhhhh. It's a secret. :secret: :P

Link to comment
Share on other sites


well to be on topic.. of course amd will never be done fore.. but to say they will ever keep u[p with intel is just obvious, amd will always be great.. just never the best

Link to comment
Share on other sites


  • 3 months later...
insnclwn Guilty

i hate bad notice, intel also have thousand of design problem, the intel chipset sucks. the older chipset 945/965 problem design, internal thermal design problem. intel only support 16x lane @ bit half duplex in low n mid range products in "high" end products is 16bit full duplex. intel south bridge/ sata controller, usb2 n 3 controller. dnt fukk many other issue. drivers problems with usb n power module. i cant talk much more about intel problems. the tlb bug was in all micro because based on ibm design intel n amd have the same tlb bug. amd fix this first then intel. stop copy/paste. intel inside marketing campaing.

Link to comment
Share on other sites


Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...