A Westmere Addition to a High-Performance Nehalem
iDataPlex Cluster and DDN S2A9900
Storage
for Texas A&M University
by Michael E. Thomadakis,
Ph.D., 2010−2011 (C)
Supercomputing Facility
miket(at)tamu(dot)edu
Visit the original article A High-Performance Nehalem iDataPlex Cluster and DDN S2A9900 Storage for Texas A&M University for an in-depth discussion of all technologies relevant to Nehalem, Nehalem-EP and the EOS cluster prior to its expansion.
All material remains copyright © 2010−2011 of Michael E. Thomadakis and of Texas A&M University. The contents of this article may be used free of charge for educational purposes, provided that this Copyright Notice will remain visible.
Table of Contents
- EOS iDP Cluster Configuration
- The Westmere Processor
- Westmere-EP Platform
- GPU Hardware on EOS Cluster
- High-Performance Interconnect
- References
- Abbreviations
- Copyright, Acknowledgements and Disclaimers
EOS is an IBM "iDataPlex" (iDP), commodity, Linux High-Performance cluster with nodes based on Intel's 64-bit CISC micro-processors. At its initial installation in early 2010, EOS consisted of 324 dx360-M2 iDP nodes. An expansion in late Spring 2011, added 48 dx360-M3 iDP nodes, four of which are equipped with nVidia GPUs based on the Tesla M2050 and M2070 HPC platform.
The present article discusses only those technologies applicable to the new hardware added to EOS. Readers are encouraged to visit the initial article A High-Performance Nehalem iDataPlex Cluster and DDN S2A9900 Storage for Texas A&M University for an in-depth discussion of all technologies relevant to EOS prior to its expansion.
Abbreviation Key
We will be using different quantities to measure capacities and speeds. To avoid confusion we will be using the following notation.
Table 1 Abbreviations of Quantities
| Powers of 2 | Powers of 10 |
|---|---|
| KiB := 210 ("Kilo-binary-Byte") | KB := 103 ("Kilo-Byte") |
| MiB := 220 ("Mega-binary-Byte") | MB := 106 ("Mega-Byte") |
| GiB := 230 ("Giga-binary-Byte") | GB := 109 ("Giga-Byte") |
| TiB := 240 ("Tera-binary-Byte") | TB := 1012 ("Tera-Byte") |
| PiB := 250 ("Peta-binary-Byte") | PB := 1015 ("Peta-Byte") |
Usually, rates, such as data transfer or floating point operations per second, are
expressed in powers of 10, while storage sizes in powers of 2.
See this reference for a
discussion on standard international units.
EOS iDP Cluster Configuration
EOS cluster currently consists of 372 iDataPlex nodes, all connected together by a Voltaire QDR 4x IB fabric. The cluster attaches to DDN9900 high-performance mass storage from DDN. Fig. 1 below illustrates the main EOS cluster components. We will use it to explain below at a high level how users can access the cluster.
Login EOS Nodes
EOS uses 5 rack-mount IBM DX3650-M2 units as login nodes. A sixth DX3650-M2 unit is used for the MAUI/Torque batch scheduler and for the management of the IB fabric and it is not accessible to the users. All DX3650-M2 hosts are based on the Nehalem-EP ccNUMA platform, each with 8 Nehalem cores, running at 2.8 GHz and with 48 GiBs of DDR3 SDRAM. One of the nodes is equipped with 128 GiBs and can be used for special interactive applications manipulating very large in memory data structures.
Computation EOS Nodes
EOS currently deploys 362 iDataPlex IBM nodes for high-performance computing. 314 of them are IBM DX360-M2 and 48 are the more recent DX360-M3 ones. DX360-M2 are two socket, quad-core / socket ccNUMA SMPs based on the Nehalem-EP platform. DX360-M3 nodes are based on the more recent Westmere-EP platform and they have two sockets, each with a six-core Westmere processor. We discuss the differences between Nehalem and Westmere micro-architectures in the next Section below.
All compute nodes are cc-NUMA Shared-memory Mulit-Processors with 24GiBs of DDR3 SDRAM. DX360-M2 have a total of 8 cores, whereas DX360-M3 have 12 cores. The processors in both node types operate at 2.8GHz.
Four of the new DX360-M3 nodes are equipped with nVidia Tesla M2050 and M2070 GPU infrastructure. "Tesla" is GPU platform targeting HPC environmnents and the M20X0 versions are intented for clusters.
Interconnection Fabric
All EOS nodes connect together by a high-speed 4X QDR Full-Bisection Bandwidth (FBB) InfiniBand fabric. The switch gear is a Voltaire Grid Director 4700 which can be expanded to up to 648 FBB ports.
I/O Service Nodes
Four rack-mount IBM DX3650-M2 units attach to the DDN9900 HPC storage and serve GPFS files out to the entire cluster.
The Westmere Processor
"Westmere" is the nickname of the micro-architecture implementing the "Intel64" ISA on a 32nm fabrication process. The Westmere processor and the Westmere-EP platform are in most other respects identical to the corresponding Nehalem ones. "Westmere-EP" chips contain up to six cores and two chips connect together via Intel's Quick Path Interconnect (QPI) to form 2 socket cc-NUMA, Shared Memory multi-Processors system.
The Intel Westmere processors, designated as "Intel Xeon 5600", are the "tick" in Intel's
"tick-tock" model of processor design. The "tick" is a new silicon process technology with
a smaller feature size, whereas the "tock" is a substantially new micro-architecture.
Nehalem was the previous "tock" (major micro-architecture enhancement) implemented on
a 45nm fabrication process. The next "tock" following Westmere is the "Sandy-Bridge"
micro-architecture.
The Westmere Processor Chip
A Westmere chip, like its Nehalem predecessor, is a Chip Multi Processor that is divided into two broad domains, namely, the core and the un-core. Fig. 2 illustrates a Westmere CMP chip and its major parts.
Fig. 2 (a) A Westmere Processor and Memory module. The processor chip contains six cores, a shared L3 cache and DRAM controllers, and Quickpath Interconnect ports.
Fig. 2 (b) A Westmere Processor Chip micro-photograph.
Referring to Fig. 2, a Westmere chip consists of six identical Intel64 cores, a Cache Interface Unit (CIU) cross-bar switch connecting the 6 cores to the 6 L3 cache segments), the level-3 cache controller and data block memory, 1 integrated memory controller with 3 DDR3 memory channels, 2 Quick Path Interconnect ports, and auxiliary circuitry for cache-coherence, power control, system management and performance monitoring logic.
Components in the "core domain" operate with the same clock frequency as that
of the actual core, which in EOS's case is 2.8GHz. The "un-core" domain operates under a
different clock frequency. This modular organization allowed Westmere to be directly based on
Nehalem's chip design. Intel engineers designed Nehalem to be "naturally" expandable by
making provisions for higher core counts.
Differences Between Westmere and Nehalem Processors
Westmere-based processors, called by Intel "Xeon 5600 series" processors, support all the features introduced by the Nehalem micro-architecture and add the following ones.
- Westmere uses a 32nm fabrication technology. Nehalem is based on a 45nm process.
- Westmere processors pack up to 6 cores per socket as opposed to 5500 series processors which had a maximum of 4 cores.
- Westmere comes with a 12 MiB L3 cache ("Enhanced Intel® Smart Cache").
- Seven new instructions are introduced to speed up processing of encryption and decryption based on the Advanced Encryption Standard standard (FIPS-197).
- The memory sub-system in Westmere-EP supports the latest LV-DDR3 (1.35V) DIMMs for reduced platform power consumption.
- The Integrated Memory Controller in the Westmere supports up to two rows of memory DIMMS to operate at the highest DDR3 frequency at 1.333GT/s.
- Westmere increased the peak CPU and I/O bandwidth to DRAM memory by increasing the per socket un-core buffers to 88 from 64 in Nehalem. This "deeper" buffering supports more outstanding memory access operations per core than possible in Nehalem-EP.
- New 1 GiB virtual memory page size supported.
- Trusted Execution Environment (TXT) is introduced to harden platform against hyper-visor, BIOS and rootkit attacks.
- Process context identifiers were added to TLB entries to reduce pressure on TLB updates on context switching.
- Two more Memory Type Range Registers (MTRRs) were added to facilitate BIOS setup memory regions more effectively.
- The APIC timer is always running to avoid drift issues during power state transitions.
- Enhanced virtualization technologies across processor and I/O.
- The Westmere chip area at 239mm, is %10 smaller than that of Nehalem's.
Westmere is designed to be "drop-in" compatible to Nehalem using the same LGA 1366
(Land Grid Array with 1366
pins) socket. It has similar thermal and power operating profiles to Nehalem it replaces.
Ideal Floating-Point Throughput on Westmere
We can say that, for the Xeon 5660 in the steady state and under ideal conditions each core can retire 4 double-precision or 8 single-precision floating-point operations each cycle. These rates are supported by the FP SIMD h/w within the execution pipeline of the core. Therefore, the nominal, ideal throughput of a Westmere core, a quad core and a 2-socket system running at 2.8 GHz, are, respectively,
67.2 Giga FLOPs / sec / socket = 11.2 GigaFLOPs/sec / core X 6 cores
134.4 Giga FLOPs / sec / node = 67.2 GigaFLOPs/sec / socket X 2 sockets,
New Instructions in Westmere
The Advanced Encryption Standard New Instructions (AESNI) extension in Westmere provides seven new instructions to accelerate symmetric block encryption / decryption of 128-bit data blocks using the Advanced Encryption Standard, specified by NIST-FIPS 197. Specifically, two instructions (AESENC, AESENCLAST) target AES encryption, two instructions (AESDEC, AESDECLAST) target AES decryption using the Equivalent Inverse Cipher. One instruction (AESIMC) targets the Inverse MixColumn transformation primitive and one instruction (AESKEYGEN) targets generation of round keys from the cipher key for the AES encryption/decryption rounds.
These additional instructions may speed up applications which transfer data over
encrypted channels, such as OpenSSH/OpenSSL, or those which encyrpt/decrupt on-the-fly data
in memory and on disk blocks.
Westmere-EP Platform and dx360-M3 Node Organization
Integrated Memory Controller and Un-Core Enhancements
The integrated memory controller (IMC) on Westmere, like its Nehalem predecessor, supports three 8-byte channels of DDR3 memory operating at up to 1.333 GigaTransfer/sec (GT/s). Fig. 3 shows the internal memory organization on the chip with the three different cache levels, local NUMA memory and the on-chip IMC.
Fig. 3 Westmere Chip plus DRAM plus QPI: On-Chip Memory Hierarchy, Data Traffic Paths through and out of the Chip
Each DDR3 channel of memory can operate independently and the IMC services requests out-of-order to minimize latency. The total theoretical bandwidth between DRAM and the IMC is 31.992 GB/s. On the average, each core can use at least 5.332GiB/s of the available memory bandwidth.
Note that Nehalem Un-Core had 64 buffers per socket available for outstanding DRAM
operations. Westmere increased this to 88 to allow more memory access operations
in-flight. A separate publication will investigate this matter in depth and it will
demonstrate the actual per core, per socket and platform memory access throughput
vs. theoretical numbers. For comparison purposed we will also provide the same results for
the Nehalem-EP platform.
Local vs. Remote Memory Access in the Westmere-EP Platform
The Westmere-EP is a cc-NUMA platform with a 2-way memory NUMA organization. Fig. 4 shows the local and remote memory organization on the Westmere-EP platform. There is one DRAM module associated with each one of the 6-core Westmere processor chips and it is managed by the on-chip IMC.
The cc-NUMA organization implies that all memory accesses to the local DRAM module have different latency and bandwidth profiles compared to accesses to the remote memory modules. Applications sensitive to memory latency or bandwidth need to take special care of the placement of their threads and data blocks. Linux provides applications with the ability to chose which set of processors each thread is allowed to run and selection of the memory NUMA node memory can be allocated from at the VM page level.
dx360-M3 SMP Nodes Architecture
The IBM dx360-M3 iDataPlex nodes implement the Westmere-EP platform. They are very similar to the dx360-M2 of the initial EOS installation. For more details, please refer to the companion publication "A High-Performance Nehalem iDataPlex Cluster and DDN S2A9900 Storage for Texas A&M University".
GPU Hardware on EOS Cluster
The recent expansion of EOS introduced GPU hardware to the cluster. Currently the Nvidia Toolkit v4.x and v3.x are supported.
dx360-M3 SMP Nodes With nVidia Tesla T20 GPU Infrastructure
Four of the dx360-M3 nodes support the nVidia T20 (Mvidia "Tesla M2050" and "M2070") GPU infrastructure. The Tesla GPU h/w is based on the latest Fermi FX100 GPU processor chip and it can be programmed to off-load the Westmere CPUs from data parallel intensive computation workloads. The four Westmere and the associated Fermi GPU hardware is:
Table 2GPU Enabled EOS Nodes
| Node Name | CPU Hardware | Quantity |
|---|---|---|
| node349 | Tesla2050 | 2 |
| node350 | Tesla2070 | 1 |
| node361 | Tesla2050 | 2 |
| node362 | Tesla2070 | 1 |
The Nvidia CUDA environment suppports the entire CUDA Development Toolkit, SDKs and samples code, as well as, a number of common numerical libraries with code that is highly optimized for the Fermi GPU architecture. These include Nvidia CUFFT (Fast-Fourrier Transform), CUBLAS (BLAS routines), CUsparse (sparse matrix routines), CUrand (random number generator routines), CUNPP (performance primitives for imaging), etc.
Fig. 5 illustrates a dx360-M3 node with two Tesla M2050 GPUs devices attached. The two M2050 devices attach to a common PCIe bus along with the QDR IB adapter via a PCIe switch.
nVidia Tesla M2070 GPU Performance on EOS
In this Section we present some basic performance results of applications running exclusively on a single GPU on EOS. We built the SHOC using the CUDA ToolKit v4.0 which is available on EOS. We exercised this GPU benchmark suite on one of the Tesla M2070 nodes.
This benchmark suite includes code that evaluates the performance of the GPU and the attached host system at three levels. The 1st level ("Level 0") measures basic low-level performance of the system, such as, transfer of data beetween host and device memories, device memory access and maximum number of single and double precission flop/s on th GPU. At the next level ("Level 1") it uses more complex code such as FFT, BLAS, prefix-sum, sorting or sparse matrix-vector multiplication. Finally, at the last level ("Level 2") it uses actual application code, such as, S3D which is a compute-intensive kernel from the S3D turbulent combustion simulation application.
These performance results should provide a good indication of the computation power of a single GPU. Note that there are libraries with code optimized to run on Nvidia GPUs which can be invoked by application code running on the host side (the CPUS). The routines in these libraries can provide performance levels similar to those presented below.
Table 3Performance of Tesla M2070 GPU on EOS Nodes
| Name | Description | Results |
|---|---|---|
| BusSpeedDownload | bandwidth of transferring data across the PCIe bus from host memory to device memory | 6.0266 GB/sec |
| BusSpeedReadback | bandwidth of reading data back from a device across the PCIe bus | 6.5736 GB/sec |
| MaxFlops | maximum achievable floating point performance using a combination of auto-generated and hand coded kernels | 1002.5300 Gflops/s single precision 503.3490 Gflops/s double precision |
| DeviceMemory | bandwidth of memory accesses to various types of device memory (global, local, and image) | 86.9093 GB/s gmem_readbw 11.8742 GB/s gmem_readbw_strided 101.0080 GB/s gmem_writebw 3.7163 GB/s gmem_writebw_strided 359.2450 GB/s lmem_readbw 439.3810 GB/s lmem_writebw 70.5548 GB/sec tex_readbw |
| FFT | forward and reverse 1D FFT | 293.8790 Gflop/sfft_sp 30.3562 Gflop/sfft_sp_pcie 294.6240 Gflop/sifft_sp 30.3642 Gflop/sifft_sp_pcie 138.7820 Gflop/sfft_dp 15.0926 Gflop/sfft_dp_pcie 139.0630 Gflop/sifft_dp 15.0959 Gflop/sifft_dp_pcie |
| SGEMM | matrix-matrix multiply | 597.5220 Gflop/ssgemm_n 593.7600 Gflop/ssgemm_t 523.2060 Gflop/ssgemm_n_pcie 520.3190 Gflop/ssgemm_t_pcie 297.5860 Gflop/sdgemm_n 296.5840 Gflop/sdgemm_t 240.0230 Gflop/sdgemm_n_pcie 239.3710 Gflop/sdgemm_t_pcie |
| MD | computation of the Lennard-Jones potential from molecular dynamics | 27.8055 GB/s md_sp_bw 12.4559 GB/s md_sp_bw_pcie 33.7287 GB/s md_dp_bw 17.8052 GB/s md_dp_bw_pcie |
| Reduction | reduction operation on an array of single or double precision floating point values | 89.6222 GB/s reduction 5.6067 GB/s reduction_pcie 90.4315 GB/s reduction_dp 5.6141 GB/s reduction_dp_pcie |
| Scan | parallel prefix sum on an array of single or double precision floating point values | 27.8229 GB/s scan 0.0065 GB/s scan_pcie 19.5829 GB/s scan_dp 0.0065 GB/s scan_dp_pcie |
| Sort | sorts an array of key-value pairs using a radix sort algorithm | 1.6585 sort 1.0837 GB/s sort_pcie |
| Spmv | sparse matrix-vector multiplication | 0.6816 Gflop/sspmv_csr_scalar_sp 0.4615 Gflop/sspmv_csr_scalar_sp_pcie 0.6428 Gflop/sspmv_csr_scalar_dp 0.3861 Gflop/sspmv_csr_scalar_dp_pcie 0.6551 Gflop/sspmv_csr_scalar_pad_sp 0.4506 Gflop/sspmv_csr_scalar_pad_sp_pcie 0.6363 Gflop/sspmv_csr_scalar_pad_dp 0.3846 Gflop/sspmv_csr_scalar_pad_dp_pcie 9.8940 Gflop/sspmv_csr_vector_sp 1.2491 Gflop/sspmv_csr_vector_sp_pcie 8.6651 Gflop/sspmv_csr_vector_dp 0.8700 Gflop/sspmv_csr_vector_dp_pcie 10.4728 Gflop/sspmv_csr_vector_pad_sp 1.2681 Gflop/sspmv_csr_vector_pad_sp_pcie 9.1886 Gflop/sspmv_csr_vector_pad_dp 0.8790 Gflop/sspmv_csr_vector_pad_dp_pcie 7.5770 Gflop/sspmv_ellpackr_sp 6.1492 Gflop/sspmv_ellpackr_dp |
| Stencil 2-D | a 9-point stencil operation applied to a 2-D data set. In the MPI version, data is distributed across MPI processes organized in a 2D Cartesian topology, with periodic halo exchanges | 3.6928 sstencil 5.3140 sstencil_dp |
| triad_bw | a version of the STREAM Triad benchmark, implemented in CUDA (it also includes PCIe transfer time) | 7.7279 GB/ss3d |
| S3D | computationally-intensive kernel from the S3D turbulent combustion simulation application | 43.2002 Gflop/ss3d 38.2952 Gflop/ss3d_pcie 24.3993 Gflop/ss3d_dp 21.3252 Gflop/ss3d_dp_pcie |
High-Performance EOS Cluster Interconnect
All 372 nodes of the EOS cluster attach to a Voltaire's IB Fabric infrastructure in the fashion Fig. 6 illustrates.
Fig. 6 The Voltaire's Grid-Director GD4700 + 3 GD4036 4x QDR IB switches providing Full-Bisection bandwidth to all 372 hosts for the EOS Cluster at Texas A&M University after the expansion.
The GD4700 is a 4x Quadruple-Data Rate (QDR) "non-blocking" InfiniBand switch. At TAMU the GD4700 currently a little more than half-way populated with 372 4x QDR ports connected. This switch has a modular architecture and it has been configured with the internal fabric infrastructure to be expandable to up to 648 4x QDR ports.
The additional 48 dx360-M3 nodes attach to 3 Voltaire GD4036 36 4xQDR port switches.
References
Literature and Presentations for the Intel Nehalem and Westmere Micro-Processors
- Stephen L. Smith, "32nm Westmere Family of Processors," Intel Dveloper Forum, IDF May 2009.
- Intel White Paper, "Introduction to Intel's 32nm Process Technology," Intel Corporation Document, December 2009.
- Intel Corporation, Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 1:Basic Architecture , Intel Corporation Document, May 2011.
- Dave Hill and Muntaquim Chowdhury, "Westmere Xeon-56xx 'Tick' CPU," Hot Chips Conference, HotChips 22, Aug 2010.
- N. A. Kurd, S. Bhamidipati, C. Mozak, J. L. Miller, P. Mosalikanti, T. M. Wilson, A. M. El-Husseini, M. Neidengard, R. E. Aly, M. Nemani, M. Chowdhury and R. Kumar, "A Family of 32 nm IA Processors," IEEE Journal of Solid-State Circuits, 48(1), pp. 119--130, January 2011
White Papers on iDataPlex IBM, Cluster x1350 and Relevant Technologies
- Michael E. Thomadakis, "A High-Performance Nehalem iDataPlex Cluster and DDN S2A9900 Storage for Texas A&M University, " Texas A&M University, 2010.
Acknowledgments and Copyright Notice
This document is the result of careful investigation based on numerous technical sources, including, papers published in the research literature, conference presentations, Intel and IBM technical reports, manuals and personal communications with developers and researchers. See References section above. The contents is responsibility of Michael E. Thomadakis and along with the original artwork remain 2010−2011 (C) copyright of his and of Texas A & M University's. Any of the contents of this page can be freely used for educational purposes, as long as the copyright notice remains visible and the original author is cited.
Disclaimer: Every effort has been made to ensure the correctness and accuracy of the contents. Note that this is a continuously evolving document and it should be considered in DRAFT STATE. Visit often for corrections and additions. Contact me at miket AT tamu.edu or at miket AT sc.tamu.edu for corrections, suggestions or additions.
The original artwork was done using the graphical tools of the OpenOffice.org