Marc Appell took part in several R&D programs which implied the development of successive generations of CMOS processors. Before his current involvement within the FAME (Flexible Architecture for Multiple Environments) projects of Intel-based servers, Marc managed Bull’s hardware development team in Phoenix Arizona during the Olympus/DPS 9000 program.
The power paradigm: Moore’s law overruled?
For decades now, processor computing power has been constantly growing. In 1975, Gordon Moore, one of Intel’s founders, modeled this sustained growth in what has come to be known as ‘Moore’s law’. The new generation of Intel® processors in the Itanium® 2 family, code-named ‘Montecito’ – after a small town of 10,000 inhabitants in Santa Barbara County, California – is no exception to the rule.
This ongoing increase in processor power goes hand in hand with the equally ongoing changes in semi-conductor etching techniques. Current versions of Itanium® 2 processors feature 130 nm transistor gate width technology (1 nanometer = 10-9 meter). Montecito processors feature 90 nm transistor gate technology: 30% smaller.
Montecito’s promises and innovations
Multi-core and multi-threading in Intel® Itanium® 2 processors
For quite some time now, this technological trend has given rise to additional performance through accelerated operating frequency whilst improving internal processor algorithms (branch predictions, multi-level caches, out-of-order instruction execution).
In pursuance of this trend, the mere number of logic gates and memory points that a single chip houses today opens the door to new innovations that Montecito is putting into play:
• ‘Multi-Core’: several full processors on a single chip. Montecito has two cores (dual core) – this is just a first step, since in the near future processors will have four cores, or even more.
• ‘Multi-threading’ (Multiple instruction threads): the possibility for one core to execute several independent instruction threads at the same time. During a cycle, a core works for only one instruction thread, but since it has two full sets of context registers, it can rapidly switch to a second instruction thread. Each Montecito core can thereby handle two instruction threads. This mechanism optimizes the use of all the core’s units. As soon as an instruction thread has to be put on hold pending an event – for example to access a piece of data that is not immediately available – the core switches over to the execution of the other instruction thread.
Montecito features both mechanisms, calling for a change in our usual vocabulary. Traditionally no distinction was made between three concepts: one chip housed one processor able to execute one instruction thread.
With Montecito these three concepts are distinct: one chip (in its package) is connected to the electronic board via a connector (socket), it houses two cores, each of them behaving, from a software point of view, as two independent logic processors (two threads).
Thus, a NovaScale 5245 server with 24 sockets potentially becomes a multiprocessor server with 96 logical processors (24 sockets x 2 cores x 2 logic processors).
This explosion in the number of processors is fully transformed into available user power when those processors are efficiently put into use by software applications. Most server software – particularly database management, Web server and HPC applications – is already designed with this revolution in mind. Further progress can be expected in the future.
Last year’s ‘Fall Processor Forum’ in San Jose, California focused on “The Road to MultiCore”. The opening speech to this annual hardware design workshop was delivered by Herb Sutter from Microsoft. Under the title “Software and the Concurrency Revolution”, he described how this technological revolution challenges the ‘threads and locks’ paradigm (parallel threads protected by software locks), on which most software applications currently base multiple parallel instruction thread management.
24 MB of cache and Pellston technology
Technological headway also allows Montecito to provide much larger on-chip caches, up to an impressive 12 MB for each core Level 3 Cache. By way of comparison, the equivalent maximum in current Itanium® 2 processors is 9 MB, over 30% more.
These caches allow masked-time memory access. The resulting improvement in performance depends on software application execution profile.
The two 12 MB caches occupy a large proportion of the Montecito chip silicon surface and are therefore sensitive to silicon defects or alpha radiations. To guarantee absolute data integrity, Intel has developed a dedicated technology called Pellston technology.
Pellston technology allows a faulty cache line to be disabled (one line of Montecito cache holds 128 bytes). During initialization, the BIOS tests all cache lines, and disables those presenting a fault. In addition, if an error is detected dynamically, the Pellston algorithm tests the faulty line and discriminates a transitory error from a solid one.
Accelerated memory access
Another major enhancement is the acceleration of transfers between Montecito cores and memory. These transfers go through a system bus, the Front Side Bus (FSB). The current Itanium® 2 processors connect to a 400 MHz FSB, (one exchange every 2.5 ns). Montecito currently allows the use of 533 MHz FSB and will soon allow the use of 667 MHz FSB.
Physical laws of electricity restrict these speeds to so-called 3 load bus configurations in which the FSB links a maximum of three elements. The memory-controlling chip must also be adapted to these new frequencies.
Controlled thermal dissipation
All these developments, while enabling increased processor computing power, also tend to increase electrical power consumption, and as a result, thermal dissipation. The thermal envelope for Itanium® 2 processors is 130W. Server cooling mechanisms have been designed to meet this specification.
Intel’s engineers have used various techniques to limit the thermal dissipation of each Montecito chip to 100W (chip with 2 cores, 4 threads and 24 MB of cache). This is a significant technological improvement compared to earlier Itanium® 2 processors, which were closer to 130W.
Bull NovaScale® servers are Montecito-ready
The new NovaScale server range is the perfect partner for Montecito processors.
One of the features of the recently announced NovaScale 3005, bi- and quadri-socket servers, is the potential use of a high speed FSB, particularly beneficial to High-Performance Computing applications which use all the memory bandwidth they can get.
The new range of NovaScale 5005 medium and large-scale servers, with modular configurations from 8 to 32 sockets, has been designed around a new version of the FAME Scalability Switch (FSS), developed especially by Bull for optimum use of Montecito and its successor Montvale.
Each NovaScale drawer includes 2 FSS. The FSS chip, 18.3 mm x 18.3 mm in size, uses 180 nm technology and communicates via four Scalability Ports (SP: 0.8 GHz bidirectional serial links) with the other components in the drawer. Each FSS chip can also communicate, via two eXtended Scalability Ports (XSP: 2.5 GHz serial links), with two other NovaScale 5005 drawers, allowing the ring inter-connection of up to four drawers forming a 32-socket server.
The FSS chip contains a ‘directory’ to limit the flow of requests between the various components of the server, a limitation vital to the effectiveness of any large-scale server. The ‘directory’ of the new-generation FSS chip is able to handle the caches of all connected cores with up to 13 MB cache per core.
Close co-operation with Intel
Throughout the Montecito project, Bull’s engineers worked in close co-operation with their Intel counterparts.
At the end of 2004, the first Montecito samples, in ‘stepping A’ version, were delivered to our laboratories in Les Clayes-sous-Bois, and tested on NovaScale servers. Do we need to point out that all the features described in this article were not yet fully operational? However, we were able to carry out preliminary trials and to test new BIOS and NovaScale server administration tool features, in particular those dedicated to Multi-Core and Multi-Thread processes.
We continued our tests in 2005 … through to the delivery by Bull, at the end of the year, of the TERA-10 supercomputer to the CEA (French Atomic Energy Authority), with several thousand Montecito chips installed and in operation, the very proof that NovaScale hardware and software are truly Montecito-ready.