A Westmere Addition to a High-Performance Nehalem
iDataPlex Cluster and DDN S2A9900 Storage
for Texas A&M University

by Michael E. Thomadakis, Ph.D., 2010−2011 (C)
Supercomputing Facility
miket(at)tamu(dot)edu


Visit the original article A High-Performance Nehalem iDataPlex Cluster and DDN S2A9900 Storage for Texas A&M University for an in-depth discussion of all technologies relevant to Nehalem, Nehalem-EP and the EOS cluster prior to its expansion.

All material remains copyright © 2010−2011 of Michael E. Thomadakis and of Texas A&M University. The contents of this article may be used free of charge for educational purposes, provided that this Copyright Notice will remain visible.

Table of Contents

EOS is an IBM "iDataPlex" (iDP), commodity, Linux High-Performance cluster with nodes based on Intel's 64-bit CISC micro-processors. At its initial installation in early 2010, EOS consisted of 324 dx360-M2 iDP nodes. An expansion in late Spring 2011, added 48 dx360-M3 iDP nodes, four of which are equipped with nVidia GPUs based on the Tesla M2050 and M2070 HPC platform.

The present article discusses only those technologies applicable to the new hardware added to EOS. Readers are encouraged to visit the initial article A High-Performance Nehalem iDataPlex Cluster and DDN S2A9900 Storage for Texas A&M University for an in-depth discussion of all technologies relevant to EOS prior to its expansion.

Abbreviation Key

We will be using different quantities to measure capacities and speeds. To avoid confusion we will be using the following notation.

Table 1 Abbreviations of Quantities

Powers of 2 Powers of 10
KiB := 210 ("Kilo-binary-Byte") KB := 103 ("Kilo-Byte")
MiB := 220 ("Mega-binary-Byte") MB := 106 ("Mega-Byte")
GiB := 230 ("Giga-binary-Byte") GB := 109 ("Giga-Byte")
TiB := 240 ("Tera-binary-Byte") TB := 1012 ("Tera-Byte")
PiB := 250 ("Peta-binary-Byte") PB := 1015 ("Peta-Byte")

Usually, rates, such as data transfer or floating point operations per second, are expressed in powers of 10, while storage sizes in powers of 2. See this reference for a discussion on standard international units.

EOS iDP Cluster Configuration

EOS cluster currently consists of 372 iDataPlex nodes, all connected together by a Voltaire QDR 4x IB fabric. The cluster attaches to DDN9900 high-performance mass storage from DDN. Fig. 1 below illustrates the main EOS cluster components. We will use it to explain below at a high level how users can access the cluster.

EOS graphic picture

Fig. 1 A graphical overview of the EOS iDP components at Texas A&M University.

The Westmere Processor

"Westmere" is the nickname of the micro-architecture implementing the "Intel64" ISA on a 32nm fabrication process. The Westmere processor and the Westmere-EP platform are in most other respects identical to the corresponding Nehalem ones. "Westmere-EP" chips contain up to six cores and two chips connect together via Intel's Quick Path Interconnect (QPI) to form 2 socket cc-NUMA, Shared Memory multi-Processors system.

The Intel Westmere processors, designated as "Intel Xeon 5600", are the "tick" in Intel's "tick-tock" model of processor design. The "tick" is a new silicon process technology with a smaller feature size, whereas the "tock" is a substantially new micro-architecture. Nehalem was the previous "tock" (major micro-architecture enhancement) implemented on a 45nm fabrication process. The next "tock" following Westmere is the "Sandy-Bridge" micro-architecture.

The Westmere Processor Chip

A Westmere chip, like its Nehalem predecessor, is a Chip Multi Processor that is divided into two broad domains, namely, the core and the un-core. Fig. 2 illustrates a Westmere CMP chip and its major parts.

Westmere Chip

Fig. 2 (a) A Westmere Processor and Memory module. The processor chip contains six cores, a shared L3 cache and DRAM controllers, and Quickpath Interconnect ports.

Westmere Chip Photo

Fig. 2 (b) A Westmere Processor Chip micro-photograph.

Referring to Fig. 2, a Westmere chip consists of six identical Intel64 cores, a Cache Interface Unit (CIU) cross-bar switch connecting the 6 cores to the 6 L3 cache segments), the level-3 cache controller and data block memory, 1 integrated memory controller with 3 DDR3 memory channels, 2 Quick Path Interconnect ports, and auxiliary circuitry for cache-coherence, power control, system management and performance monitoring logic.

Components in the "core domain" operate with the same clock frequency as that of the actual core, which in EOS's case is 2.8GHz. The "un-core" domain operates under a different clock frequency. This modular organization allowed Westmere to be directly based on Nehalem's chip design. Intel engineers designed Nehalem to be "naturally" expandable by making provisions for higher core counts.

Differences Between Westmere and Nehalem Processors

Westmere-based processors, called by Intel "Xeon 5600 series" processors, support all the features introduced by the Nehalem micro-architecture and add the following ones.

Westmere is designed to be "drop-in" compatible to Nehalem using the same LGA 1366 (Land Grid Array with 1366 pins) socket. It has similar thermal and power operating profiles to Nehalem it replaces.

Ideal Floating-Point Throughput on Westmere

We can say that, for the Xeon 5660 in the steady state and under ideal conditions each core can retire 4 double-precision or 8 single-precision floating-point operations each cycle. These rates are supported by the FP SIMD h/w within the execution pipeline of the core. Therefore, the nominal, ideal throughput of a Westmere core, a quad core and a 2-socket system running at 2.8 GHz, are, respectively,

11.2 Giga FLOPs / sec / core = 2.8 GHz X 4 FLOPs / Hz
67.2 Giga FLOPs / sec / socket = 11.2 GigaFLOPs/sec / core X 6 cores
134.4 Giga FLOPs / sec / node = 67.2 GigaFLOPs/sec / socket X 2 sockets,
in terms of double-precision FP operations.

New Instructions in Westmere

The Advanced Encryption Standard New Instructions (AESNI) extension in Westmere provides seven new instructions to accelerate symmetric block encryption / decryption of 128-bit data blocks using the Advanced Encryption Standard, specified by NIST-FIPS 197. Specifically, two instructions (AESENC, AESENCLAST) target AES encryption, two instructions (AESDEC, AESDECLAST) target AES decryption using the Equivalent Inverse Cipher. One instruction (AESIMC) targets the Inverse MixColumn transformation primitive and one instruction (AESKEYGEN) targets generation of round keys from the cipher key for the AES encryption/decryption rounds.

These additional instructions may speed up applications which transfer data over encrypted channels, such as OpenSSH/OpenSSL, or those which encyrpt/decrupt on-the-fly data in memory and on disk blocks.

Westmere-EP Platform and dx360-M3 Node Organization

Integrated Memory Controller and Un-Core Enhancements

The integrated memory controller (IMC) on Westmere, like its Nehalem predecessor, supports three 8-byte channels of DDR3 memory operating at up to 1.333 GigaTransfer/sec (GT/s). Fig. 3 shows the internal memory organization on the chip with the three different cache levels, local NUMA memory and the on-chip IMC.

Westmere memory hierachy

Fig. 3 Westmere Chip plus DRAM plus QPI: On-Chip Memory Hierarchy, Data Traffic Paths through and out of the Chip

Each DDR3 channel of memory can operate independently and the IMC services requests out-of-order to minimize latency. The total theoretical bandwidth between DRAM and the IMC is 31.992 GB/s. On the average, each core can use at least 5.332GiB/s of the available memory bandwidth.

Note that Nehalem Un-Core had 64 buffers per socket available for outstanding DRAM operations. Westmere increased this to 88 to allow more memory access operations in-flight. A separate publication will investigate this matter in depth and it will demonstrate the actual per core, per socket and platform memory access throughput vs. theoretical numbers. For comparison purposed we will also provide the same results for the Nehalem-EP platform.

Local vs. Remote Memory Access in the Westmere-EP Platform

The Westmere-EP is a cc-NUMA platform with a 2-way memory NUMA organization. Fig. 4 shows the local and remote memory organization on the Westmere-EP platform. There is one DRAM module associated with each one of the 6-core Westmere processor chips and it is managed by the on-chip IMC.

The cc-NUMA organization implies that all memory accesses to the local DRAM module have different latency and bandwidth profiles compared to accesses to the remote memory modules. Applications sensitive to memory latency or bandwidth need to take special care of the placement of their threads and data blocks. Linux provides applications with the ability to chose which set of processors each thread is allowed to run and selection of the memory NUMA node memory can be allocated from at the VM page level.

Westmere 2 Socket EP Platform

Fig. 4 Local and Remote Memory Organization in a Westmere-EP 2 Socket Node.

dx360-M3 SMP Nodes Architecture

The IBM dx360-M3 iDataPlex nodes implement the Westmere-EP platform. They are very similar to the dx360-M2 of the initial EOS installation. For more details, please refer to the companion publication "A High-Performance Nehalem iDataPlex Cluster and DDN S2A9900 Storage for Texas A&M University".

GPU Hardware on EOS Cluster

The recent expansion of EOS introduced GPU hardware to the cluster. Currently the Nvidia Toolkit v4.x and v3.x are supported.

dx360-M3 SMP Nodes With nVidia Tesla T20 GPU Infrastructure

Four of the dx360-M3 nodes support the nVidia T20 (Mvidia "Tesla M2050" and "M2070") GPU infrastructure. The Tesla GPU h/w is based on the latest Fermi FX100 GPU processor chip and it can be programmed to off-load the Westmere CPUs from data parallel intensive computation workloads. The four Westmere and the associated Fermi GPU hardware is:

Table 2GPU Enabled EOS Nodes

Node Name CPU Hardware Quantity
node349 Tesla2050 2
node350 Tesla2070 1
node361 Tesla2050 2
node362 Tesla2070 1

The Nvidia CUDA environment suppports the entire CUDA Development Toolkit, SDKs and samples code, as well as, a number of common numerical libraries with code that is highly optimized for the Fermi GPU architecture. These include Nvidia CUFFT (Fast-Fourrier Transform), CUBLAS (BLAS routines), CUsparse (sparse matrix routines), CUrand (random number generator routines), CUNPP (performance primitives for imaging), etc.

Fig. 5 illustrates a dx360-M3 node with two Tesla M2050 GPUs devices attached. The two M2050 devices attach to a common PCIe bus along with the QDR IB adapter via a PCIe switch.

Westmere 2 Socket EP Platform + Tesla M2050

Fig. 5 A dx360-M3 node attached to two Tesla T20 GPUs (M2050).

nVidia Tesla M2070 GPU Performance on EOS

In this Section we present some basic performance results of applications running exclusively on a single GPU on EOS. We built the SHOC using the CUDA ToolKit v4.0 which is available on EOS. We exercised this GPU benchmark suite on one of the Tesla M2070 nodes.

This benchmark suite includes code that evaluates the performance of the GPU and the attached host system at three levels. The 1st level ("Level 0") measures basic low-level performance of the system, such as, transfer of data beetween host and device memories, device memory access and maximum number of single and double precission flop/s on th GPU. At the next level ("Level 1") it uses more complex code such as FFT, BLAS, prefix-sum, sorting or sparse matrix-vector multiplication. Finally, at the last level ("Level 2") it uses actual application code, such as, S3D which is a compute-intensive kernel from the S3D turbulent combustion simulation application.

These performance results should provide a good indication of the computation power of a single GPU. Note that there are libraries with code optimized to run on Nvidia GPUs which can be invoked by application code running on the host side (the CPUS). The routines in these libraries can provide performance levels similar to those presented below.

Table 3Performance of Tesla M2070 GPU on EOS Nodes

Name Description Results
BusSpeedDownload bandwidth of transferring data across the PCIe bus from host memory to device memory 6.0266 GB/sec
BusSpeedReadback bandwidth of reading data back from a device across the PCIe bus 6.5736 GB/sec
MaxFlops maximum achievable floating point performance using a combination of auto-generated and hand coded kernels 1002.5300 Gflops/s single precision
503.3490 Gflops/s double precision
DeviceMemory bandwidth of memory accesses to various types of device memory (global, local, and image) 86.9093 GB/s gmem_readbw
11.8742 GB/s gmem_readbw_strided
101.0080 GB/s gmem_writebw
3.7163 GB/s gmem_writebw_strided
359.2450 GB/s lmem_readbw
439.3810 GB/s lmem_writebw
70.5548 GB/sec tex_readbw
FFT forward and reverse 1D FFT 293.8790 Gflop/sfft_sp
30.3562 Gflop/sfft_sp_pcie
294.6240 Gflop/sifft_sp
30.3642 Gflop/sifft_sp_pcie
138.7820 Gflop/sfft_dp
15.0926 Gflop/sfft_dp_pcie
139.0630 Gflop/sifft_dp
15.0959 Gflop/sifft_dp_pcie
SGEMM matrix-matrix multiply 597.5220 Gflop/ssgemm_n
593.7600 Gflop/ssgemm_t
523.2060 Gflop/ssgemm_n_pcie
520.3190 Gflop/ssgemm_t_pcie
297.5860 Gflop/sdgemm_n
296.5840 Gflop/sdgemm_t
240.0230 Gflop/sdgemm_n_pcie
239.3710 Gflop/sdgemm_t_pcie
MD computation of the Lennard-Jones potential from molecular dynamics 27.8055 GB/s md_sp_bw
12.4559 GB/s md_sp_bw_pcie
33.7287 GB/s md_dp_bw
17.8052 GB/s md_dp_bw_pcie
Reduction reduction operation on an array of single or double precision floating point values 89.6222 GB/s reduction
5.6067 GB/s reduction_pcie
90.4315 GB/s reduction_dp
5.6141 GB/s reduction_dp_pcie
Scan parallel prefix sum on an array of single or double precision floating point values 27.8229 GB/s scan
0.0065 GB/s scan_pcie
19.5829 GB/s scan_dp
0.0065 GB/s scan_dp_pcie
Sort sorts an array of key-value pairs using a radix sort algorithm 1.6585 sort
1.0837 GB/s sort_pcie
Spmv sparse matrix-vector multiplication 0.6816 Gflop/sspmv_csr_scalar_sp
0.4615 Gflop/sspmv_csr_scalar_sp_pcie
0.6428 Gflop/sspmv_csr_scalar_dp
0.3861 Gflop/sspmv_csr_scalar_dp_pcie
0.6551 Gflop/sspmv_csr_scalar_pad_sp
0.4506 Gflop/sspmv_csr_scalar_pad_sp_pcie
0.6363 Gflop/sspmv_csr_scalar_pad_dp
0.3846 Gflop/sspmv_csr_scalar_pad_dp_pcie
9.8940 Gflop/sspmv_csr_vector_sp
1.2491 Gflop/sspmv_csr_vector_sp_pcie
8.6651 Gflop/sspmv_csr_vector_dp
0.8700 Gflop/sspmv_csr_vector_dp_pcie
10.4728 Gflop/sspmv_csr_vector_pad_sp
1.2681 Gflop/sspmv_csr_vector_pad_sp_pcie
9.1886 Gflop/sspmv_csr_vector_pad_dp
0.8790 Gflop/sspmv_csr_vector_pad_dp_pcie
7.5770 Gflop/sspmv_ellpackr_sp
6.1492 Gflop/sspmv_ellpackr_dp
Stencil 2-D a 9-point stencil operation applied to a 2-D data set. In the MPI version, data is distributed across MPI processes organized in a 2D Cartesian topology, with periodic halo exchanges 3.6928 sstencil
5.3140 sstencil_dp
triad_bw a version of the STREAM Triad benchmark, implemented in CUDA (it also includes PCIe transfer time) 7.7279 GB/ss3d
S3D computationally-intensive kernel from the S3D turbulent combustion simulation application 43.2002 Gflop/ss3d
38.2952 Gflop/ss3d_pcie
24.3993 Gflop/ss3d_dp
21.3252 Gflop/ss3d_dp_pcie

High-Performance EOS Cluster Interconnect

All 372 nodes of the EOS cluster attach to a Voltaire's IB Fabric infrastructure in the fashion Fig. 6 illustrates.

GD4700 Overview

Fig. 6 The Voltaire's Grid-Director GD4700 + 3 GD4036 4x QDR IB switches providing Full-Bisection bandwidth to all 372 hosts for the EOS Cluster at Texas A&M University after the expansion.

The GD4700 is a 4x Quadruple-Data Rate (QDR) "non-blocking" InfiniBand switch. At TAMU the GD4700 currently a little more than half-way populated with 372 4x QDR ports connected. This switch has a modular architecture and it has been configured with the internal fabric infrastructure to be expandable to up to 648 4x QDR ports.

The additional 48 dx360-M3 nodes attach to 3 Voltaire GD4036 36 4xQDR port switches.

References

Literature and Presentations for the Intel Nehalem and Westmere Micro-Processors

  1. Stephen L. Smith, "32nm Westmere Family of Processors," Intel Dveloper Forum, IDF May 2009.
  2. Intel White Paper, "Introduction to Intel's 32nm Process Technology," Intel Corporation Document, December 2009.
  3. Intel Corporation, Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 1:Basic Architecture , Intel Corporation Document, May 2011.
  4. Dave Hill and Muntaquim Chowdhury, "Westmere Xeon-56xx 'Tick' CPU," Hot Chips Conference, HotChips 22, Aug 2010.
  5. N. A. Kurd, S. Bhamidipati, C. Mozak, J. L. Miller, P. Mosalikanti, T. M. Wilson, A. M. El-Husseini, M. Neidengard, R. E. Aly, M. Nemani, M. Chowdhury and R. Kumar, "A Family of 32 nm IA Processors," IEEE Journal of Solid-State Circuits, 48(1), pp. 119--130, January 2011

White Papers on iDataPlex IBM, Cluster x1350 and Relevant Technologies

  1. Michael E. Thomadakis, "A High-Performance Nehalem iDataPlex Cluster and DDN S2A9900 Storage for Texas A&M University, " Texas A&M University, 2010.

Acknowledgments and Copyright Notice

This document is the result of careful investigation based on numerous technical sources, including, papers published in the research literature, conference presentations, Intel and IBM technical reports, manuals and personal communications with developers and researchers. See References section above. The contents is responsibility of Michael E. Thomadakis and along with the original artwork remain 2010−2011 (C) copyright of his and of Texas A & M University's. Any of the contents of this page can be freely used for educational purposes, as long as the copyright notice remains visible and the original author is cited.

Disclaimer: Every effort has been made to ensure the correctness and accuracy of the contents. Note that this is a continuously evolving document and it should be considered in DRAFT STATE. Visit often for corrections and additions. Contact me at miket AT tamu.edu or at miket AT sc.tamu.edu for corrections, suggestions or additions.

The original artwork was done using the graphical tools of the OpenOffice.org