Jump to content

Is IE9 cheating at SunSpider benchmark?


nsane.forums

Recommended Posts

47KIy.jpg

The SunSpider JavaScript performance benchmark, devised by the developers of the WebKit browser engine, is used and quoted widely as a measure of browser scripting performance. A surprising result was recently noticed by a Mozilla developer, Rob Sayre, looking at Internet Explorer 9's performance in this test. On one of the many subtests it performs, Internet Explorer 9 was finishing the subtest almost instantly.

In and of itself, that's not necessarily very interesting; several of the subtests in SunSpider are near-instant in the browser. However, it piqued the developer's curiosity. He made some minor changes to the test—changes that don't alter the result of the calculation the test performs and that, naively at least, should be treated as equivalent—and saw Internet Explorer 9 slow down considerably. He filed a bug against Internet Explorer.

Sayre's bug report was conservative—he suggested that an optimization that Internet Explorer 9's Chakra JavaScript engine was performing was fragile, and was easily disabled by minor alterations to the code that it should disregard. Coverage earlier today of the same issue was less guarded: Internet Explorer 9 was accused of cheating in the test. The allegation is that Microsoft has built a specific optimization into Chakra that detects, and bypasses, the specific code in SunSpider, but which has no other purpose. In other words, the optimization will not do anything to improve the browser's performance in any other scenario.

Historical precedent

Such a move would not be completely unprecedented. The SPEC organization produces benchmarks for processors, mail servers, JVMs, and a range of other tasks. Its CPU benchmarks are commonly used for evaluating processor performance across a wide range of mostly real-world tasks; it is, or at least was, an important suite of tests. An old version of the benchmark, SPEC2000, included a test of floating point performance called 179.art.

179.art was a program to test image recognition using a neural network. The benchmark code is representative of real-world code; it has not been engineered for maximum performance, and in fact does a number of things that hurts its performance in various ways. A programmer wanting to make 179.art go faster would have a range of reasonably simple changes he could make to yield a healthy performance improvement. But the SPEC tests do not allow programmers to make changes; only optimizations performed by the compiler are permitted.

So what happened was that compiler vendors modified their compilers to specifically detect that they were compiling 179.art, and applied these specific changes. Sun was probably the first to do so, but eventually such optimizations became widespread, with performance gains of 30 times or more common. The optimizations were in no sense general-purpose: they accelerated 179.art, but would not increase the performance of any other piece of code on the planet. Sometimes, they were not even safe: minor code variations that should have changed the result of the calculations would still be subjected to the 179.art-specific optimizations, resulting in broken programs.

Against this backdrop, the suspicion of Internet Explorer is, therefore, at least somewhat understandable. High-profile benchmarks carry with them a lot of bragging rights, and if Microsoft were to tweak Chakra to ensure it got good results, they certainly wouldn't be the first.

Dead code elimination

But that's probably not what's happened here.

The exact optimization in question here is one of a class called dead code elimination. A surprisingly common feature of many programs is that they contain pieces of code that are pointless—dead code. There are two main kinds of dead code. Sometimes, there is no possible pathway through the program that can result in a particular piece of code being executed. The code is said to be "unreachable." One common scenario that leads to unreachable code is when a programmer wants to temporarily skip part of a program; they will do something such as prepend the code with "if(false)" to allow it to be bypassed. Sometimes unreachable code is a bug; a programmer writes a piece of code expecting it to be executed, without noticing that the program quits a few lines above.

The other major kind of dead code is more general; it's code that can be reached and executed, but whose results are never used. This kind of dead code is a common pitfall in benchmark programs. Because benchmarks don't do any useful work, they have a common tendency to perform some slow, expensive task (the one whose performance they are attempting to measure) and then simply ignore the result. After all, a benchmark is generally concerned only with how long something takes. Since this is pointless, compilers are quite entitled to remove the slow-but-ignored calculations. It makes the program run faster, and since the results were ignored anyway, it doesn't change the output of the program.

The subtest of SunSpider with the anomalous results is called "cordic." It tests a function that computes the sine and cosine of a number using a CORDIC algorithm. In many ways this is highly artificial: JavaScript contains built-in sine and cosine functionality, functionality that will be much faster than performing the computation in this way, so it is not something real programs would ever do. And true to many benchmarks, the test does not bother using the results that it has computed. This makes the entire test susceptible to dead code elimination.

And lo, this is what Internet Explorer 9 does. It—accurately—treats the entire test as dead code, and so removes the whole lot. This makes for a very fast benchmark result indeed.

So the optimization itself is legitimate and of a kind that is common to compilers. It's one of the best optimizations there is, in fact—the optimized code runs instantly, as it has been entirely removed, and you don't get much faster than instant.

But what about the fact that small and apparently irrelevant modifications prevented the optimization from kicking in? Compiler optimization is a tricky thing, and compiler authors tend to be very conservative about which optimizations they apply. A program that produces the wrong answer is far worse than a program that's a little bit slow, with a result that if a fragment of code does not follow exactly the pattern the compiler expects, it won't apply the optimization.

The compilation process is normally a multistage affair. The compiler reads the program source code, checking that it makes sense according to the grammatical rules of the language. It builds a kind of in-memory representation of the program, during which it typically ensures that the program "makes sense." This representation is then turned into actual executable code.

Optimizations can be performed both on the in-memory representation, and during the generation of executable code. Different kinds of optimization make sense at different stages. If the intermediate representation, or the final executable code, matches a pattern the optimizer is looking for, the optimization will be applied.

A fragile optimization

Small changes, even changes that should be innocuous, can alter the intermediate representations that the optimizer actually examines. The changes that Sayre made to the cordic test were small, and didn't fundamentally alter the structure of the test code. However, the impact that those changes may have had on the intermediate representations is hard to predict—we don't know exactly what Microsoft's compiler is doing, how it's representing the program internally, or what exact patterns it is looking for. It might well be that the pattern matching is just particularly fragile, and that small changes are throwing it off, and preventing the optimization.

Experimentation by readers of Hacker News paints a complex picture. Some modifications defeated the optimization, but others did not. Microsoft has described a few other code fragments that fit the pattern and trigger the optimization. Sayre has performed further analysis of the browser's behavior; functions that use a limited range of mathematical operations (including addition, subtraction, and incrementing) can be eliminated as dead code. Functions that use other mathematical operations, however—including multiplication and division—will not be.

It happens—whether by coincidence or by design—that the mathematical operations used by the cordic test are all on the "permitted" list, and as such can be eliminated. However, other functions that also use those same operations can also be eliminated. While it's possible, and, I think, likely, that this optimization was "inspired" by cordic, it has been written in such a way that it has applicability beyond that test. This really is an optimization that can apply to other functions, just as long as they meet certain criteria.

To my mind, that makes it a legitimate optimization. If it could only accelerate cordic, it would be illegitimate, as the optimization would have no "real-world" application. But examples have been constructed that also get optimized in the same way. It's not yet clear how much real-world code gets optimized like this, but it's certainly possible that some does, and that's good enough for me.

This is essentially the same standard that SPEC uses for determining the legitimacy of compiler optimizations: an optimization that can only apply to the test is forbidden. But an optimization that can apply to other programs too (even if only a limited number of other programs) is acceptable.

An optimization not fragile enough

A danger of dead code elimination is that the compiler will eliminate code that is not actually dead; code that causes some effect that, for whatever reason, the compiler does not recognize. Eliminating that code will eliminate the effect caused by the code, thereby breaking programs. Unfortunately, it appears that Internet Explorer 9's dead code elimination has this very issue.

Sayre provides a few scenarios in his analysis demonstrating this problem. Essentially, the dead code elimination would be legitimate in a language like C# or Java, but due to some of JavaScript's unusual features, it is not legitimate in JavaScript. JavaScript allows objects, even built-in ones like arrays and numbers, to be modified, such that even apparently simple tasks like adding numbers or accessing data in an array can do unexpected, programmer-specified things.

This makes dead code elimination very dangerous. Code that is dead for "normal" objects (ones that have not been modified in strange ways) can suddenly be "alive" for these unusual ones. Microsoft's optimization does not properly handle this case, and so will treat code as dead even when it should not.

On the one hand, this is precisely the kind of reason why an optimization would be applied narrowly in the first place—it's just too easy to break programs by using an optimization more liberally than is safe. On the other hand, it shows that even the narrow scenarios that Microsoft is applying dead code elimination to are too broad.

No need to cheat

There are also important differences between the cordic test and 179.art. The 179.art optimizations caused a substantial increase in the overall SPECfp score. The 179.art optimization came in 2002, at a time when x86 processors were starting to really put the hurt on RISC processors and starting to take big chunks out of their market share. By 2003, Intel and AMD were posting SPECfp scores of around 1200. Sun's UltraSPARCs were languishing at the mid-900s. The 179.art optimization took UltraSPARC's score up past 1100—enough to be respectable, compared to the cheaper x86 alternatives. The 179.art optimization was, therefore, well worth doing.

That really isn't the case here; of course the optimization helps the overall score, but the scores of the top browsers are so similar (all in the 200-300 millisecond range) that they're just trading places without any meaningful difference between them. Even without the optimization, Internet Explorer 9 achieves a score within that range; it doesn't need, and isn't getting, a 179.art-style boost to its scores.

The way the scoring works means that even if the Internet Explorer team did feel it necessary to cheat, a single optimization wouldn't be the way to do it. It would take a whole raft of special-cased optimizations that applied across the range of tests, and we would expect to see consistent improvement.

But if we look at the actual results, (PP6, PP7) they look, well, realistic. PP7 is faster overall than PP6—to be expected, given PP7's billing as a performance-focused release—but not universally. There are a couple of performance regressions, which certainly give the results a veneer of plausibility. Compiler optimization is not an exact science, and a change that may be beneficial most of them time, and hence worth making, can result in worse performance some of the time.

And then, of course, there's a whole wide world beyond SunSpider. In use, Internet Explorer 9 feels fast. It's rendering real websites quickly. It also runs the various demos on Microsoft's test drive site quickly. This strongly suggests that the browser is genuinely quick, regardless of how well it performs in some benchmark or other.

Transparency matters

From a purely practical standpoint, if Chakra were open source, this entire issue would not have arisen. Anyone concerned by the performance behavior would be able to examine the compiler for themselves and see just why this particular optimization was being applied in some cases but not in other, very similar cases. For a minority of developers and users who have commented on this issue, I suspect that nothing short of this kind of open approach will be satisfying as evidence that the compiler is acting in a legitimate way.

For the rest of us, the question is whether or not we believe the Internet Explorer team when it says that this particular optimization works for more than SunSpider. If it does, then it is a legitimate optimization; if it doesn't, it is a deceitful one. Though many transformations of the code do indeed break the optimization—and hence imply that there is something dubious going on—there are transformations that don't. The optimization does indeed apply to code apart from SunSpider. That is very strong evidence that the browser is truly attempting to perform dead code elimination.

In conjunction with the fact that this doesn't make sense as a cheat, and certainly isn't enough to explain the browser's excellent performance across a wide range of websites, I think the claim that it is legitimate is stronger still.

It would, of course, be nice if the optimization were more robust and more predictable—and, of course, correct. If nothing else, the current behavior leads to results that are, for many Web developers, surprising. But surprising doesn't mean dishonest.

view.gif View: Original Article

Link to comment
Share on other sites


  • Replies 9
  • Views 1.7k
  • Created
  • Last Reply
  • Administrator

Caught red handed. :angry:

Even when it looks like it's not their fault, I always thought how it performed so well on SunSpider and so badly Peacekeeper. Now I know why.

Link to comment
Share on other sites


  • Administrator

Not really, I've seen everything on Peacekeeper where most of the benchmarks are done are properly visible and you can easily tell the difference. If the browsers have done something against it, you can also load your own pages with own content and see what's the best.

I've done my own tests to say that. :)

Link to comment
Share on other sites


@DKT27:

I'm not talking about tests. I'm talking about codes.

If you don't understand, then I rest my case :bag:

Link to comment
Share on other sites


You know, I look at this a little differently, and they definitely never cheated on the the Acid3.. especially with IE8.. LOL but if your creating a browser that will be tested on something and you don't necessarily cheat, but create something that does well on the test.. could you really call it cheating.. AND, what exactly was changed in the code, and what was done precisely?... I mean it wouldn't be hard to do.. and you would have to run the other browsers against this new test... to really say it was 'cheating' and the only way I would think this, is when the browser actually bypasses the test all together and uses its own resources to show that it was actually executed.

While it is still being written, still in Alpha, Beta, and Preview Phase.. I could really say; that it is very possible that the code is not even finished nor ready to be battered against the tests yet. Final items and points, bugs, and other things may not be worked out yet.. To me.. really making a statement like this.. Is like watching Nancy Grace on CNN.. sensationalize something that everyone already gets ' stirred-up ' about at the drop of a hat... before the facts are in ... almost for viewer ratings and numbers.. Something that really sticks your eyeballs to the screen ...It is premature... and when you consider all of the outstanding points which are left to question by what has been done and occurred.. I would call it an incompetent means to an end..

This would not be something that IE9 would be able to hide in the performance department as a final version, in the public hands.. it would be obvious.. and even at that why would they release the final version with all of the HTML5 improvements and new browser without addressing the issues... which would in the endpoint effect almost every aspect of the browser? It doesn't make any sense...

Maybe its to keep the barracuda's and piranha's at bay until it's finished.. while they attempt to chew them up as they do at every turn, on every aspect.. That would make sense.. something that is out there as proof of concept until final..

Link to comment
Share on other sites


  • Administrator

Well ZDNet did criticized the story being sensationalized and also asked for a pardon because I believe they were the first to break this so called news.

But that apart, this assures me that IE isn't worth to be in my list as far as the speed is concerned, I thought I should give it a chance with all those results, but not anymore. They would be great in HTML5, but is HTML5 is a standard yet?

Link to comment
Share on other sites


Archived

This topic is now archived and is closed to further replies.

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...