Bull NovaScale Intensive servers are the third generation of servers designed by Bull to be based on Intel Itanium processors. They are modular, mainframe-class, high-end 64-bit SMP servers featuring powerful partitioning functions. They can run under multiple environments: Linux®, Windows® and GCOS. They are also at the heart of the DataScale™ offering recently launched by Bull and its partner Oracle.
Designed to support major databases, Business Intelligence applications and High-Performance Computing (HPC), these servers are based on the EPIC (Explicit Parallelism’ Instruction Computing) architecture, and are fitted with anything up to 32 dual-core Intel Itanium 64-bit processors. They all feature the same systems administration and management software: NovaScale Master.
The RAS imperatives (for Reliability, Availability and Maintainability) have been taken into account right from the architecture design stage, all the way through to the system roll-out for the customer1
The reliability of NovaScale servers is all down to design quality, choice of components, and the manufacturing process itself.
Mainframe-class design and architecture
The design and validation of NovaScale Intensive servers have been undertaken from start to finish by Bull’s R&D teams in France. The design draws essentially on two of Bull’s real strengths:
- Solid experience in mainframe design
- Close partnerships with Intel, Microsoft and other partners’ R&D teams.
Reliability prediction models for the configuration were systematically set up, and were based on mathematical modules and MTBF2
data provided by component suppliers.
From the initial design stage, the operating temperature defined for the NovaScale FSS3
infrastructure was lower than authorized maximums as dictated by the technical specification. This means that Bull can guarantee a longer life-cycle for the components, and therefore for NovaScale servers. A sophisticated dynamic cooling mechanism and high output fans were designed to achieve this.
The system architecture for these NovaScale servers, along with the utilization of a system of component redundancy, enables servers to be configured with no Single Point Of Failure (SPOF)4
Respect for standards and application of quality procedures
As well as conforming to current regulatory standards, NovaScale servers benefit from ISO 9001 certification awarded to Bull for the application and management of its product life-cycles. By the same token, Bull’s manufacturing and logistics division – always keen to make its contribution to environmental protection and sustainable development – has obtained a number of QSE certifications including ISO 9001 (Quality), OHSAS 18001 (Security) and ISO 14001 (Environmental Management).
Added value in manufacturing
NovaScale Intensive servers are manufactured entirely at Angers, at the heart of Bull’s industrial site in France. Here too, manufacturing processes are key to the reliability of these servers. Rigorous testing is carried out at every stage of the manufacturing process to verify the operating quality of each component.
For each customer order, servers are assembled and then mounted into high or low-level racks. The wiring for the whole configuration is then completed, including the server itself, as well as storage, back-up and network components. The operating system is also pre-loaded and all components included in the order (PCI cards, memory cards, disks, processors, partitioning...) are put through functional tests.
The manufacturing process also includes multiple testing sequences over a period of several days, enabling parameters such as temperature and voltage to be varied, in order to check that servers function correctly when subjected to the most extreme operating conditions possible. The last stage is a resilience test carried out on each NovaScale server, in its final configuration, immediately before delivery.
These rigorous testing sequences ensure consistent levels of quality and robustness for all servers delivered, so they are ready to go into service immediately in the customer’s data center.
Once the server is up and running for the customer, the Production Manager or Server Administrator will have access to all the functionalities they need to ensure server administration and availability
Ensuring effective server monitoring
The task of monitoring NovaScale servers is handled by NovaScale Master, the secure administration platform designed by Bull to operate alongside existing system supervision solutions such as Evidian Open Master.
NovaScale Master is built around a specific environment – PAM (Platform Administration Management) – whose design is directly inspired by mainframe system management. It supervises NovaScale Intensive servers in operation, enabling early identification and/or warning of impending failure. This interface really unites the NovaScale server’s RAS capabilities.
Detecting a breakdown
In order to detect any failures in the architecture of NovaScale Intensive servers, there is an ongoing dialogue between the PMB (Platform Maintenance Board) module card and the administration station (PAP). This dialogue functions in two modes: ‘out-of-band’ mode (the O/S remains unused) and ‘in-band’ mode (the O/S is part of the process).
The PMB card designed and built by Bull is a removable card comprising a sub-assembly of the NovaScale Intensive server module. This is really the cornerstone of the server’s administration and maintenance, since it administers powering up and initialization routines for the other components, maintaining their operating status, and reconstructing sequences of events leading up to errors or faults, failures and changes in status.
Locating the fault precisely and setting off an instantaneous warning
Thanks to a comprehensive display showing the presence and operational status of all server components via the PAM’s graphics interface, it is easy to precisely locate a failure very quickly. The authorized Systems Administrator has access to all this information via NovaScale Master, using a simple Internet browser.
If an incident should take place during any power-up, shut-down, forced shut-down or domain re-initialization, or during normal operating, a message displays in the status panel for the domain concerned, and a trace is saved in the start-up and shut-down logs for that domain. The PAM software can signal the incident to connected and non-connected users via:
- The PAM Web interface (status page and/or user log files)
- Electronic mail (email) or text message (SMS) for authorized users
- Automatic dial-up to Bull’s support center (depending on the user’s maintenance contract) to analyze the incident and implement corrective maintenance or preventive measures.
Correcting a failure automatically and dynamically
Many failures are handled dynamically, without having any effect on server availability, thanks to:
- ECC-type mechanisms (error detection and correction) applied to memory, but also to all the data transfer file paths, throughout the module, even including the FSS chips. These transfers are achieved using sophisticated protocols that include powerful recovery mechanisms
- Advanced automated control architecture (Intel® Advanced MCA Machine Check Architecture Technology) and cache integrity (Intel® Cache Safe Technology) at processor level
- Complete power-up testing
- Automated power-on and ventilation control
- Disk systems protected by RAID mechanisms.
Remote access to all availability mechanisms, and to the PAM software in particular, is also possible. The most appropriate Bull experts can be mobilized, wherever they may be, to resolve the most complex issues.
Minimizing the consequences of failure and server unavailability
A certain number of hardware components can be replaced with no impact on the server’s operation, including power supply boxes, ‘hot-swap’ type ventilators, PCI cards and internal ‘hot-plug’ disks.
If, in the wake of an incident, the operational status of all the server’s components cannot be immediately ensured, NovaScale servers offer the possibility of recovery in a ‘downgraded configuration’. For example, cards or processors suspected of being defective can be temporarily excluded from the configuration until such time as they can be replaced. The PAM enables such exclusions to be made with no need for any physical intervention on the platform hardware. The affected server or partition is immediately relaunched.
In addition, the NovaScale Intensive server physical partitioning functions mean that partitions are physically independent and isolated from one another. One partition can be disconnected without this having any effect on the other active partitions. So, in the event of an incident or maintenance operation on a partition, most of the hardware components for this partition can be physically replaced without having any effect on the others. Physical partitioning for NovaScale Intensive servers is used particularly in the largest data centers to ensure maximum availability for production activities by isolating them in a dedicated partition, while at the same time carrying on other operations– such as tests or development work5
– in parallel, in another partition.
Keeping an open mind, analyzing the various ‘log’ and ‘reporting’ events, and understanding and deciding on possible ways forward
Once a problem has been identified as a change in status, caused either as a result of a regular query, or when a warning is received, the system administrator will look for additional information in order to try and understand what has happened. He will want to know the probable causes, the sequence of events, the context of the incident, etc. Two indispensable functions in this case are the inventory information (machine type, disk capacity, OS type, number of processes, etc.) and the event reports (status sequences, digital graphs, etc.).
Both these functions are provided by NovaScale Master. The inventory information helps with understanding the context of the problem. The reporting information quantifies the problem in time (When did it happen? How many times? Did it happen gradually or suddenly?). Reporting can also be used in a preventive way to monitor system loading and performance, and so to anticipate future problems. Once a problem has been analyzed and understood, it only remains to act on the system, and at best resolve the problem, or at worst, to set up a by-pass.
Reliability, availability, and maintainability of NovaScale servers are also reinforced by other infrastructure solutions:
- High-availability software solutions such as ARF (Application Roll-over Facility) for Linux, or MSCS (Microsoft Clustering System) for Windows
- Load-balancing solutions such as DDFA (Dynamic Domains for Applications) for Linux or NLBC (Network Load Balancing Clusters) for Windows
- Secure storage solutions with disaster recovery, such as Bull StoreWay FDA
- Automated Recovery Plans, with precise definitions of intervention timescales, of repair and availability commitments
- Customized support via the Bull HA Center, which provides 24x7 remote surveillance of systems and takes proactive maintenance action via secure IP remote access.
The robustness and ease of administration of NovaScale servers is proven, even in the most demanding implementations. As an example, for its TERA-10 supercomputer (a cluster of more than 600 NovaScale Intensive servers) the French Atomic Energy Authority (the CEA) had very high demands in terms of service continuity and ease of administration. In such a cluster, the real rate of availability depends largely on the capacity to administer the whole structure. Monitoring the TERA-10 is based on NovaScale Master, which can handle all the NovaScale servers, the network and storage from a single central point.
1 Server characteristics described in this document are available on the NovaScale Intensive 5000, 7000 and 9000 Series of servers.
2 MTBF: Mean Time Between Failure – estimated running time without failure.
3 FSS: FAME Scalability Switch: sophisticated chip designed by Bull (60 million transistors), that ensures each processor has input/output access, as well as a coherent vision of the global memory, which can be as high as 512GB. The temperature is maintained at 73°C, as compared with a temperature of 100°C which would have been technologically acceptable.
4 If a component fails, it is considered to be a SPOF if it prevents the server from functioning for as long as it is left unrepaired. This optimization is effective within the same module (components: QBB, processor, memory, I/box, FSS, PMB, PCI card, internal disk, power supply and ventilator) or in several modules linked together.
5 Partitioning also enables hosting on a single server of several operating systems, or of several occurrences of an operating system, for example Windows, Linux or GCOS.