Intel’s ‘NEHALEM-EP’ processors go green
By Jean-François Lemerre, Bull
New architecture and new functionality: the latest Intel® Xeon® processor combines performance and energy efficiency
Graduate of Ecole Polytechnique de Paris, Jean-François Lemerre is head of the Performance department in Bull’s Server Design and Development Business Unit. He is Bull’s representative in official performance bodies including SPEC (the Standard Performance Evaluation Corporation) and TPC (the Transaction Processing Performance Council).
The Nehalem-EP (NHM) socket is the codename for this newest addition to the Intel Xeon® 5500 range of processors, featuring Intel’s new micro-architecture, successor to its Enhanced Intel Core architecture. EP is the version of Nehalem that enables the design of dual-socket servers. Its architecture is optimized to offer a very high performance-consumption ratio. Other sockets using the Nehalem core include the single-socket workstation version, sold under the name of Core i7, and the expandable, scalable EX version designed to enable the construction of servers featuring four or more sockets.
Nehalem-EP, like Core i7, features four integrated cores on the same chip. And like its precursor Penryn (more precisely Harpertown in the dual-socket range) it is manufactured using 45 nm1 technology. It is x86-64 compatible.
Key characteristics of Nehalem-EP:
- Integrated memory controller offering three channels, making it possible to triple the memory throughput in comparison with previous architectures
- Three levels of inclusive cache, with Level 3 (8MB) being shared between the four cores
- A hyperthreading mechanism: each core can process two threads simultaneously, creating the equivalent of 16 virtual processors for a dual-socket server
- A QPI (Quick Path Interconnect) bus system
- Support for SSE4.2 instructions and virtualization
- Clock and energy management for each individual core.
Mechanisms for increasing application performance: integrated memory controller, optimized NHM core and hyperthreading, enhanced cache
A new feature in Intel architecture: the integrated memory controller. The North-bridge component featured in earlier architectures has disappeared; it provided access to a centralized memory for all cores. This had a measurable bottleneck effect on the operation of many applications.
When each socket connects directly to memory it means the bandwidth for accessing memory increases with the number of processors. In effect, each socket provides three DDR3 memory channels that operate at up to 1333MHz. This provides a theoretical supply of 32GB/s throughput per socket, or 64GB/s for a dual-socket server, compared with 21GB/s of FSB with the previous generation. This enables the theoretical memory throughput to be multiplied by three. In practice, a throughput approaching 40GB/s for the two sockets has been measured using the STREAM benchmark.
With this new architecture, the time taken to access memory is no longer uniform, but depends on the respective position of the core and the memory chips being accessed: this type of architecture is referred to as Non-Uniform Memory Architecture (NUMA). The data access time is measurably faster when a processor uses the memory attached to the socket to which it belongs, than when it uses the memory of a distant socket. The so-called NUMA factor is the ratio between these two data access times.
The impact in terms of performance is minimized thanks to two features of the new servers.
- The first of these is the result of a new software feature that enables the operating system to take into account the NUMA factor while prioritizing the use of the so-called ‘local’ memory at the time memory is allocated to an application.
- The other effect is a second update of the hardware architecture with the introduction of QPI (Quick Path Interconnect). Each Nehalem-EP socket is equipped with two QPIs. This new communication link is both rapid, and offers large bandwidth, enabling the exchange of data between the processors, and the input/output DMAs and the different memories. The QPI is an extra-powerful equivalent of the AMD hyper transport link. With an operating frequency of 6.4 GT/s2, it enables the exchange at 25.6 GB/s (12.8 in each direction). This exchange rate enables the cumulative throughput of PCI Express Gen2 cards or a substantial memory throughput between the two sockets to be supported. The low latency of the QPI minimizes data access times – 108ns for distant memory, compared with 64ns for local memory, which in effect means the NUMA factor is less than 2.
The NHM core enables a significant improvement in performance despite a frequency that is rather less than that of the previous generation. In effect, everything possible has been done to favor increased numbers of instructions per cycle.
Among the improvements that have been noted, the Out Of Order execution engine has been enhanced with a larger instruction re-order buffer, and the connection prediction mechanism has been redesigned to improve the effectiveness of the prediction facility and reduce the penalties for faulty predictions. The decoding of x86 instructions into fused elementary micro-instructions is now even more effective. Finally, the SSE4.2 instruction set has been enhanced, the synchronization (LOCK*) instructions are more efficient, and an improved device for virtualization has been implemented.
Hyperthreading has also been improved, with the doubling of a number of registers. This enables the interplay of two threads without any loss of time: as soon as the first is hampered by any possible lack of resources – memory, for example – the load is ‘handed over’ to the other thread, using the associated registers. Although this technology (which has proved effective for databases) lowered performance levels in High-Performance Computing (HPC), it has been established that this is not the case with the new core.
A shared layer between the four larger cores (8MB) also delivers significant improvements. This replaces the two-by-two caches of the Penryn architecture, and enables very low ‘miss’ rates. The data at this level can therefore be accessed more rapidly by the four cores. Nehalem uses an architecture with three levels of cache, where Level 3 is the shared cache. Each core accesses the last level of shared cache via the Level 1 caches – these remain unchanged at 32K instructions and 32K data – and a Level 2 cache of 256KB. All these are inclusive caches, and this simplifies coherence management. The data access latencies to these various caches are of four cycles, that is 1.4ns for L1, 10 cycles or 3.3ns for L2, and 40 cycles, or 13ns for the L3 cache.
Unprecedented functionality at the heart of the processor to optimize energy efficiency
Another extremely interesting aspect of the Nehalem-EP socket is the potential to manage the processor’s power consumption. This is made possible by the dynamic change in the processor frequency and power supply voltage, which in turn enables the processor’s power consumption to be controlled. This is, in effect, partly proportional to the frequency and the square of the voltage. It is also worth noting that this mechanism can be used to put some parts of the circuit into ‘idle’ mode by changing the frequency to 0, and that it also enables the electricity consumption of each individual socket to be measured.
These kinds of control mechanisms can be implemented either to limit consumption, or to increase performance levels. For example, when some cores are unused, it will be possible to switch them into ‘idle’ mode, either to follow an intelligent energy management policy and avoid going over a pre-defined level of energy consumption, or to take advantage of the energy made available to increase the processor’s speed of execution; so-called ‘Turbo’ mode.
Reference diagram for a Nehalem-EP dual-socket server
From processor to applications, the advantages of Nehalem-EP have been directly evaluated by Bull’s experts
In conclusion, let us compare the use of two successive generations of architectures, Harpertown and Nehalem-EP, in two Bull servers. For each server, we will consider two sockets, four cores per socket, utilized at maximum frequency. Our examination of the two configurations led to the following conclusions:
- A theoretical peak performance in Gigaflops3 that is measurably identical (4 floating-point operations per cycle and per core) since the frequency is roughly identical, at around 3GHz
- Memory latency is improved by approximately 50ns, when accessing ‘local’ memory of the socket under consideration; in other words, it is almost halved
- Memory throughput was multiplied by three
- Spec CPU2006 ‘rate’ results, in other words with all cores active, very clearly improved by a factor of 1.8 for the integer benchmarks (250 against 140 Specintrate2006) and 2.5 for the floating-point benchmarks (195 Specfprate2006 against 79). Even though these benchmarks cannot be said to be perfectly representative, they give us an interesting idea of the relative performance of these processors, all the more so as the compiler is the same, or almost the same. So here we can measure the effect of the optimizations of the core and of the hyperthreading.
- Handling performances are also exceptional for business applications. The TPC-C and TPC-E benchmarks yield values of the order of 632 Ktpm4 and 800 tps achieved by some IT makers (against 275 and 317 tps for the earlier generation).
1 nm: nanometers
2 GT/s: Giga Transfers per second
3 Gigaflops: one billion floating-point operations per second
4 Ktpm: Kilo transactions per minute