Subscribe to
Bull Direct:

RSS Press Events




Intel’s ‘NEHALEM-EP’ processors go green
By Jean-François Lemerre, Bull



New architecture and new functionality: the latest Intel® Xeon® processor combines performance and energy efficiency 

Graduate of Ecole Polytechnique de Paris, Jean-François Lemerre is head of the Performance department in Bull’s Server Design and Development Business Unit. He is Bull’s representative in official performance bodies including SPEC (the Standard Performance Evaluation Corporation) and TPC (the Transaction Processing Performance Council). 

The Nehalem-EP (NHM) socket is the codename for this newest addition to the Intel Xeon® 5500 range of processors, featuring Intel’s new micro-architecture, successor to its Enhanced Intel Core architecture. EP is the version of Nehalem that enables the design of dual-socket servers. Its architecture is optimized to offer a very high performance-consumption ratio. Other sockets using the Nehalem core include the single-socket workstation version, sold under the name of Core i7, and the expandable, scalable EX version designed to enable the construction of servers featuring four or more sockets.

Nehalem-EP, like Core i7, features four integrated cores on the same chip. And like its precursor Penryn (more precisely Harpertown in the dual-socket range) it is manufactured using 45 nm1 technology. It is x86-64 compatible.

Key characteristics of Nehalem-EP:


Mechanisms for increasing application performance: integrated memory controller, optimized NHM core and hyperthreading, enhanced cache
A new feature in Intel architecture: the integrated memory controller. The North-bridge component featured in earlier architectures has disappeared; it provided access to a centralized memory for all cores. This had a measurable bottleneck effect on the operation of many applications. 
When each socket connects directly to memory it means the bandwidth for accessing memory increases with the number of processors. In effect, each socket provides three DDR3 memory channels that operate at up to 1333MHz. This provides a theoretical supply of 32GB/s throughput per socket, or 64GB/s for a dual-socket server, compared with 21GB/s of FSB with the previous generation. This enables the theoretical memory throughput to be multiplied by three. In practice, a throughput approaching 40GB/s for the two sockets has been measured using the STREAM benchmark.  

With this new architecture, the time taken to access memory is no longer uniform, but depends on the respective position of the core and the memory chips being accessed: this type of architecture is referred to as Non-Uniform Memory Architecture (NUMA). The data access time is measurably faster when a processor uses the memory attached to the socket to which it belongs, than when it uses the memory of a distant socket. The so-called NUMA factor is the ratio between these two data access times.
The impact in terms of performance is minimized thanks to two features of the new servers.

The NHM core enables a significant improvement in performance despite a frequency that is rather less than that of the previous generation. In effect, everything possible has been done to favor increased numbers of instructions per cycle.
Among the improvements that have been noted, the Out Of Order execution engine has been enhanced with a larger instruction re-order buffer, and the connection prediction mechanism has been redesigned to improve the effectiveness of the prediction facility and reduce the penalties for faulty predictions. The decoding of x86 instructions into fused elementary micro-instructions is now even more effective. Finally, the SSE4.2 instruction set has been enhanced, the synchronization (LOCK*) instructions are more efficient, and an improved device for virtualization has been implemented.  
Hyperthreading has also been improved, with the doubling of a number of registers. This enables the interplay of two threads without any loss of time: as soon as the first is hampered by any possible lack of resources – memory, for example – the load is ‘handed over’ to the other thread, using the associated registers. Although this technology (which has proved effective for databases) lowered performance levels in High-Performance Computing (HPC), it has been established that this is not the case with the new core.  

A shared layer between the four larger cores (8MB) also delivers significant improvements. This replaces the two-by-two caches of the Penryn architecture, and enables very low ‘miss’ rates. The data at this level can therefore be accessed more rapidly by the four cores. Nehalem uses an architecture with three levels of cache, where Level 3 is the shared cache. Each core accesses the last level of shared cache via the Level 1 caches – these remain unchanged at 32K instructions and 32K data – and a Level 2 cache of 256KB. All these are inclusive caches, and this simplifies coherence management. The data access latencies to these various caches are of four cycles, that is 1.4ns for L1, 10 cycles or 3.3ns for L2, and 40 cycles, or 13ns for the L3 cache.

Unprecedented functionality at the heart of the processor to optimize energy efficiency 
Another extremely interesting aspect of the Nehalem-EP socket is the potential to manage the processor’s power consumption. This is made possible by the dynamic change in the processor frequency and power supply voltage, which in turn enables the processor’s power consumption to be controlled. This is, in effect, partly proportional to the frequency and the square of the voltage. It is also worth noting that this mechanism can be used to put some parts of the circuit into ‘idle’ mode by changing the frequency to 0, and that it also enables the electricity consumption of each individual socket to be measured.
These kinds of control mechanisms can be implemented either to limit consumption, or to increase performance levels. For example, when some cores are unused, it will be possible to switch them into ‘idle’ mode, either to follow an intelligent energy management policy and avoid going over a pre-defined level of energy consumption, or to take advantage of the energy made available to increase the processor’s speed of execution; so-called ‘Turbo’ mode.  

Reference diagram for a Nehalem-EP dual-socket server


From processor to applications, the advantages of Nehalem-EP have been directly evaluated by Bull’s experts 

In conclusion, let us compare the use of two successive generations of architectures, Harpertown and Nehalem-EP, in two Bull servers. For each server, we will consider two sockets, four cores per socket, utilized at maximum frequency. Our examination of the two configurations led to the following conclusions:

1 nm: nanometers

2 GT/s: Giga Transfers per second 

3 Gigaflops: one billion floating-point operations per second 

4 Ktpm: Kilo transactions per minute