Hyper-Threading is Back!
Believe it or not, one remnant from the dark days of Netburst has found its way onto Nehalem. Granted, Hyper-Threading (Intel’s brand name for Simultaneous Multi-Threading) was one of the few positives of these processors. In the case of Nehalem, SMT is pretty much required in order to fully make use of the vast amount of bandwidth the CPU is designed to work with.
Although the Netburst CPUs benefited from Hyper-Threading, the ‘deeper’ (longer pipeline, higher clock speeds) approach to an execution engine is less suited for SMT than the ‘wider’ (shorter pipeline, more execution units) design on Core and Nehalem. It’s even possible that if not for a FSB bottleneck, we might have seen Hyper-Threading on Core chips.
Overall, with SMT, Nehalem’s performance will scale higher, while requiring less power.
Exclusive L3 Cache
Nehalem features a 3-level cache hierarchy – each core has its own, unshared 64KB of L1 cache, 256KB of L2 cache. In addition, there is 8MB of shared L3 cache.
You might be thinking, that seems like a miniscule amount of L2 cache; after all, Yorkfield Core 2 Quads had 12MB of the stuff. However, Nehalem has improved the way the individual cores use cache. The L2 cache essentially becomes a buffer to the L3 cache. Any data that is kept in an L2 cache can be found in the shared L3 cache. This way, if a core requires the same date, it can be found in the L3 cache. If it’s not in the L3 cache, it won’t be in any of the L2 caches either. This prevents cores from using up cycles to ‘snoop’ another core’s cache for data that isn’t there.
As is often the case, Intel is introducing some new Streaming SIMD Extension instructions. Not a full new set this time, but seven new instructions that are added to SSE4. The new instructions are meant to accelerate XML and pattern related string- and text-based tasks.
In this slide, Intel shows how the new instructions make XML parsing much more efficient. Rather than parse every single character, SSE4.2 allows entire strings to be parsed. As XML becomes more important (and robust) this should be highly beneficial in the years to come.
Improved Loop Stream Detector
The Loop Stream Detector, or LSD, was introduced with Core microarchitecture, but has been improved on Nehalem.
On Core, the LSD queue sat between the Decode and Fetch stages of the instruction pipeline. This allowed it to cache and execute small repetetive loops (up to 18 entries) without having to re-fetch the instructions from the L1 cache over and over. The fetch hardware could then be disabled, saving power.
Nehalem builds on that concept by placing the LSD to the instruction pipeline after the decode stage. Furthermore, rather than caching raw x86 instructions, Nehalem can cache up to 28 decoded micro-ops. This way, they can shut down the Branch Prediction, Fetch, and Decode units, saving even more power.
It seems like Nehalem is all about bringing back retro goodies from the good old days. First, we have Hyper-Threading making its triumphant return, and it is better than ever. In addition to that, Nehalem supports Turbo Mode, which you might remember from the 486 days, where pressing a button on your PC would double the CPU’s speed from 33 MHz to 66 MHz for instance.
On Nehalem, Turbo Mode works in the background, and contrary to rumors, is available on all Core i7 processors and not just Extreme Edition versions. When it is enabled, it essentially overclocks the processor by increasing the multiplier by a certain amount depending on how many cores are being loaded. For instance:
When you buy a 3.2 GHz i7 965, it actually hits 3.33 GHz when two or more cores are loaded. Under a single core load, it can reach 3.46 GHz.
The Core i7 940 runs at a stock speed of 2.93 GHz. When Turbo kicks in under a multi-core load, it reaches 3.06 GHz. Under a single-core load, it hits 3.2 GHz.
And finally, the Core i7 920, with a clock speed of 2.66 GHz, reaches a multi-core speed of 2.8 GHz, and a single-core speed of 2.93 GHz.
Turbo Mode is nice to have, but may lead to confusion with all the different clock speeds under different scenarios. In the end, they will all have relative performance, and it is possile that lower-end Core i7’s may not support this feature. Also, it is self-managed, meaning that it will only kick in under certain temperatures, and only under a perfectly stable voltage. When overclocking, Turbo should probably be disabled.
You might even think of it more as ‘reverse throttling’. Although under most circumstances, the 965 will reach 3.46 GHz easily. However, it cannot run at that speed under all circumstances, so it can’t be sold under that speed. Still, it’s nice that they are offering this, and allowing non OCers to still be able to get the most from their processors.
It should be noted that on Extreme Editon versions, which have a completely unlocked multiplier, Turbo Mode can be set individually. In other words, you could simply crank up the multiplier on the 965 Extreme Edition from 24x to 26x, and have Turbo Mode set to 27x and beyond. You can even set individual multipliers depending on how many cores are loaded. So if the CPU is stable with one core at high speeds, but fails under multi-core loads, you could set the Turbo Mode to only kick in under than scenario.
I briefly mentioned on the previous page about QPI, and I’ll go into a bit more detail here. To give you an idea on how much more bandwidth Nehalem has than Penryn, consider than a Penryn quad-core with a 1600 MHz FSB had about 12.8 GB/s of bandwidth, going one way or the other (reading or writing – it couldn’t do both at once).
The QPI used in today’s Core i7 Extreme Edition processors has two 20-bit links, for reads and writes. This works out to 6.4 GT/s (4.8 GT/s for the lower models), or a usable bandwidth of 12.8 GB/s (9.6 GB/s for the lower models). However, since the QPI can perform both reads and writes at the same time, the peak theoretical bandwidth rate is upwards of 25.6 GB/s (or 19.2 GB/s for the lower models). That is assuming there is a perfect balance of reads and writes to saturate the bus.
So on the desktop side that we focus on, the Core i7 has up to twice as much bandwidth to the I/O controller than Core 2 ever had. With immense scalability, QPI really hits its stride on the server side of things though, where cores can have an incredible amount of bandwidth and shorter paths to communicate to each other with. We don’t really focus on server products here, but I will direct you to a good flash presentation by Intel that explains QPI in great detail with regards to server applications.