IBM p5-575 Cluster 1600
and
DDN S2A9550 Storage Array
by Michael E. Thomadakis, Ph.D., Senior SC staff (**)
Table of Contents
- Introduction
- The Power5+ Processor
- Simultaneous Multi-Threading
- p5-575 Node Architecture
- Cluster HPS Interconnect
- HPS Protocol Stack
- Cluster Storage
- Performance Results
- Cluster Configuration Summary
- References
- Abbreviations
- Copyright
hydra is a high-performance IBM "Cluster 1600", based on IBM's 64-bit Power5 microprocessor. The cluster currently consists of a total of 52 model p5-575 nodes, each having 16 Power5+ processors running at 1.9GHz and 32 GiBytes1 of DDR2 DRAM. Each p5-575 node is a high-performance, Shared-memory Multi-Processor (SMP), running the 64-bit version of AIX 5L (5.3) as a single system image. Out of the 52 nodes, 48 are connected together by a two-plane High-Performance Switch (HPS) fabric. The same 48 nodes connect to a DataDirect Networks DDN S2A9550 high-performance disk storage array. The normal mode of problem-solving on Hydra is running distributed or shared-memory computations under the control of the LoadLeveler batch scheduler. See hydra's user guide for information, including compiling and running interactive and batch jobs.
Introduction
This write up presents a rather detailed discussion of the IBM Cluster 1600 and underlying technologies. These make this system one of the top high-performance platforms in the 2007 time-frame.
Our target audience falls into two broad categories. The first category includes those seeking intimate and detailed understanding of inner workings of the Power5+/Cluster 1600 platform, for the purpose of developing scalar and running, parallel or distributed computations that fully utilize its numerous capabilities, while avoiding unnecessary bottlenecks. The second one includes, those who need accurate account of the underlying architecture for the purpose of studying parallel and distributed systems in order to improve them. The Power5+/Cluster 1600 architecture is one stage in the evolution of high-performance systems based on the Power Instruction Set Architecture. The discussion here may also serve as background material for a better understanding of subsequent clusters which currently are based on Power6 and the upcoming Power7.
Note that the information presented here is not readily available elsewhere. Technical material on the p5-575 and the HPS internals are very hard to come by. The author of this article had to delve into a long list of sources from the research literature along with the standard onerous IBM technical documentation. All material remains copyright © 2007--2009 of Michael E. Thomadakis and of Texas A& University.
This report will focus on the Power5+ micro-processor, the p5-575 multi-processor node, the High-Performance Switch, the associated software and protocol stacks, the Load Leveler batch scheduler and the General Parallel File System. References are given at the end for further investigation.
The Power5+ Processor
Power5 processors implement the 64-bit PowerPC instruction set architecture (PPC64 ISA). In the literature "Power5" and "Power5+", refer, respectively, to the 120 and 90 nano-meter lithography implementations of the chip. TAMU has the Power5+ version and we would be using Power5 and Power5+ interchangeably.
Overview of the Power5 Processor Chip
A Power5+ chip is a "Chip-Multi Processor" (CMP) as it contains two identical processor cores and shared Level 2 Cache memory within a single die Fig. 1 illustrates a Power5+ CMP chip. Theoretically each core can achieve 7.6 GFlops/sec and, thus, a 16 core system, 121.6 GFlops/sec.
A Power5+ processor-memory module consists of a dual-core processor chip (with both cores active, operating at 1.9 GHz), eight memory DIMM slots and a private, high-performance, custom 36MiB Level 3 (L3) cache memory. Each processor chip contains a shared 1.9MiB Level 2 (L2) cache, the memory controller and L3 cache directory. The L2 and L3 are divided into three slices which can be accessed independently by any of the cores. The L2 slices can be accessed by the "Core-Interface Unit" cross-bar.
All processor chip on-board resources operate at the 1.9 GHz frequency. However, the memory controller operates asynchronously, with the memory interface running at 1066 MHz to connect with the 533 MHz DDR2 memory. Notice that there is an "asymmetry" in the data bus widths connecting the parts together. Each Power5 core has two 32 byte input buses and only one 8 byte output bus. One input bus goes to the Level 1 instruction cache and the other to the L1 data cache. The CIU connects the two cores to each one of the three L2 memory cache slices. Each slice can be accessed independently by the cores. Each L2 cache memory slice has an 8 byte input and a 32 byte output bus connecting to the interface with the CIU. Each L2 slice has a 32 byte input and a 32 byte output bus attaching "downstream" to the "Enhanced Distributed Switch" (EDS). The EDS is a switch which routes data into and out of the Power5 processor to other neighboring Power5 chips. It is responsible to exchange data, commands and coherence traffic with neighboring EDS. The Layer 3 cache memory is off-chip and it is shared by both cores. There is one 16 byte input and a 16 byte output data bus connecting the L3 cache with the EDS. These buses and the L3 cache operate at 1/2 of the processor clock frequency, at 950MHz. The interface to the L3 cache is "elastic" which means that there is adequate buffering at both ends to allow for the signal timing variations that are possible at these high clock speeds. The L3 cache is shared between the two cores and it is also segmented into tree slices. Two of the coherent point-to-point connections with neighboring processor-memory modules are called the "Vertical" and the "SMP" fabric buses. Each is 8 bytes wide per direction and all operate at 1/2 the processor clock frequency, at 950MHz.
The high performance of the Power5+ architecture relies on the fact that the Fabric-Bus logic, the DRAM controller, the L3 controller and directory are housed within the same die as the two processor cores. This saves a significant amount of off-chip communications and makes possible a tightly coupled low-latency, high bandwidth CMP system.
Fig. 1 A Power5+ Processor and Memory module. The processor chip contains two cores, L3 cache and DRAM controllers, and Enhanced Distributed Switch for coherent SMP traffic.
A Power5 processor chip consists of the following resources which are shared by both cores. Referring to Fig. 1, there are
- two Power5+ cores
- CIU: Cache Interface Unit (crossbar switch connecting the 2 cores to the 3 L2 segments)
- L2C: a three-segment level-2 cache (shared by both cores)
- L3cd: level-3 cache directory and controller
- Mc: main memory controller
- EDS: Enhanced Distributed Switch (fabric bus controller)
Outside the Power5 chip, but at close physical proximity, we find the three-segment L3 cache memory. Physical proximity is accomplished by placing the L3 and the Power5 chips within the same carrier module. As we will see later this packaging comes into at least three variations, DCM, QCM or MCM.
Instruction Pipeline in the Core
Power5 processor cores are super-scalar, super-pipelined microprocessors, with speculative, out-of-order execution data-paths, coupled with a multilevel storage hierarchy. Power5 has extensive support for data pre-fetching, branch prediction and speculative instruction execution. Fig. A illustrates in the part (a) the instruction pipeline and in part (b) the different functional units within a Power5 core.
Fig. A a.The instruction pipeline of a Power5 processor.
b.The functional units in the Power5 processor data-path and their multiplicity
for sharing between the two hardware SMT threads.
- IF: Instruction Fetch; IC: Instruction Cache (on-chip)
- BP: Branch Prediction
- Di: Decode Stage i
- GD: Group Dispatch
- MP: Mapping
- ISS: Instruction Issue
- RF: Register File Read
- EX: Execute; EA: Compute Effective Address;
- F6: 6-cycle Floating-Point Execution Pipeline
- DC: Data Caches (on-chip
- fmt: Data Format
- WB: Write-Back Stage;
- Xfer: Transfer;
In the sequel, we are discussing the instruction flow through the Power5 pipeline and we will be referring to Fig. A parts (a) and (b).
In IF stage, the processor retrieves up to 8 instructions from the L1 (instruction) cache, in the same clock-cycle. The L1 insurrection cache is a 2-way set-associative 64KiB on-chip memory. Instructions are scanned for branches and if they are found the direction is predicted (at BP stage) using 3 branch history tables. If all fetched instructions are branches, all can be predicted simultaneously. Power5 can predict the direction and the target of the branch. Information on branches are maintained in a special BIQ unit at fetch time. After fetch, instructions are put into buffers each capable of storing 24 instructions.
Subsequently, up to 5 instructions are retrieved (at D0 stage) to form a group (stage D1), all from the same thread. Instruction groups are then decoded in parallel. When all data-path resources are available the entire group is dispatched (at GD stage), with each instruction in program order.
After dispatch, each instruction goes through "register renaming" (MP stage) where "logical" (or "architected") instruction registers are mapped to physical registers within the actual register file in the data-path. This mechanism resolves the so called "write-after-write" and "write-after-read" data hazards. After renaming, instructions enter shared "issue queues" waiting for the execution stage. Power5 simplifies instruction progress tracking by tracking each group. A Global Completion Table (GCT) maintains information about each dispatched instruction group. Groups which are dispatched enter GCT and as instructions finish, the entry for the entire group is deallocated when all of its instructions have finished and their changes have been committed to architected storage. GCT entries are allocated and deallocated in program order for each thread, but entries for different threads can be intermingled.
Load and store instructions in a group before dispatching require the allocation of entries in the Load Reorder Queue (LRQ) and Store Reorder Queue (SRQ), respectively. These queues are used to maintain the program order of load and store instructions.
When all input operands of an instruction are available, this instruction is placed in an issue queue (ISS stage). The processor then selects the oldest instruction in the ISS. Instructions can be chosen from different groups and SMT threads for issue without distinction. Up to 8 instructions can issue in each clock cycle simultaneously, one to each of the 8 functional units in the data-path (see later).
Upon issue, each instruction retrieves inputs from the physical registers (at RF stage) and executes at the proper functional unit. Specifically, the branch execution unit, the fixed-point execution units, and the logical condition register unit are used in the EX pipeline stage. The load and store units are used in the EA, DC, and Fmt pipeline stages. Floating-point execution units are sued in the F1 through F6 pipeline stages. The results are finally written back to the output physical register (at WB stage).
When all of the instructions in a group have finished (without generating any exception) and the group is the oldest group of a given thread, the group is committed (at CP stage). One group can commit per cycle from each thread.
Power5 Data-Path and Functional Units
A Power5 processor core includes the following Functional Units (FUs) and resources which are shared by both SMT h/w threads.
- FXU fixed point (2 integer units)
- ISU instruction sequencing
- IDU instruction decoding
- LSU load/store (2 units)
- IFU instruction fetch
- FPU floating point (2 units)
- MC memory controller (on-chip)
- L2 level-2 cache (on-chip)
- level-3 cache directory and controller (on-chip)
- L3 level-3 cache memory (off-chip but on same module (DCM or MCM, see later)
Floating-Point Processing and Exception Handling
Please review the following write-up on Floating Point Exception handling for Power5.Simultaneous Multi-Threading
A Power5+ core supports Simultaneous Multi-Threading (SMT), a design scheme which allows two hardware threads to execute simultaneously within each core and share its resources. Fig. A part (b) shows the different functional units within a Power5 core which the two SMT threads can share.
The objective of SMT is to allow the 2nd hardware thread to utilize functional units in a core which the 1st hardware thread leaves idle. If sufficient functional units are available, both threads can "simultaneously" (i.e., at the same clock period) utilize them, making progress. The alternative would be to let a thread run until it has to stall (e.g., waiting for a lengthy FP operation or a cache memory miss to be handled), at which point in time the OS dispatcher would have to carry out a costly context-switching operation with processor state swapping. SMT exploits concurrency at a very fine level and it obviates the need for a context-switching. The advantages of SMT are several, including among others, the increased utilization of functional units that would have remained idle, the overall increased throughput in instructions completed per clock cycle and the overhead savings from the lower number of thread switching operations.
When SMT is ON, each Power5 core appears to the Operating System as two logical processors. An SMT enabled p5-575 node appears as 32 logical processors to AIX. SMT enabled processors offer performance advantages to certain types of computation mixes. In our experiments, multi-threaded computations proceed faster when the average run-queue length on a node is > 16. Single-Threaded mode is much slower in this case. SMT is more beneficial when the contention for common resources shared by the two hardware threads is lower. These resources can be functional units within the core and L1, L2 or L3 cache memory blocks within the same Power5 chip. If used properly experiments have shown that SMT increases the completion rate for instructions within a system. Studies suggest that the benefit accrued for certain applications can reach up to 15% to 35% reduction in compute time.
The increased miniaturization doubles the number of transistors on a die approximately every 18 months ("Moore's Law"). This allows more functional units and memory to become available on-chip. This initially lead to the design of very large and complex, monolithic processor cores. These in general, are much harder to design and implement properly, with very long and costly design and debug cycles.
On another front the initial euphoria in modern processors employing super-scalar, wide-issue, super-pipelined, out of order completion pipelines, passed quickly when it became clear that these designs have passed the point of diminishing returns. Even in the most complex designs, most of the execution resources often remain idle simply because single thread computation in most code does not have the high Instruction-Level Parallelism (ILP) opportunities which are necessary to utilize them.
The higher chip density came at another increasing cost: the shrinking of the transistor gate sizes has increased leakage current, which lead to higher power consumption. With increased clock frequencies the on-chip interconnects have higher relative latencies and are becoming a major cost factor. In short, the higher power consumption and low resource utilization has made the designs of large monolithic processor cores with ever increasing clock frequencies undesirable.
SMT, along with Chip-MultiProcessor (CMP) technologies, appear to be the next step in the evolution of processor design. SMT benefits code which has reached its ILP scalability limit. CMP is a much more efficient form of SMP since data does not have to cross chip boundaries which is relatively very costly. It is expected that SMT will not only increase the utilization of idle functional units but also make the ratio of Flop/Watt more favorable as more functional units carry out useful work per clock cycle, thus requiring fewer clock cycles to finish a computation.
SMT has been shown to benefit code with instructions experiencing longer than average clocks-per-instruction rate. For instance, when one hardware thread stalls on a memory miss, the 2nd one can be using an FP or Int unit. However, SMT is not a panacea and may not benefit all types of codes. There have also been cases where code experiences a relatively mild slow-down. Code with heavy usage of the floating point units or the L1 and L2 caches most likely wont benefit from SMT. Compilers and operating systems must now take into consideration these new parameters if SMT is to be used beneficially. Even though SMT is not a new area, more investigation is necessary to make its deployment more readily usable.
The Power5 core provides a number of partitioned or replicated units to be shared between the two h/w SMT threads, including
- Instruction Fetch Unit: pre-fetch buffer (split, 2-entry per thread); BIQ (split, 8-entry per thread; branch prediction control (replicated); link stack (replicated)
- Instruction Decoding Unit: 6 entry IFB per thread
- Instruction Issue Unit: 20-entry Global Completion Table (GCT) linked-list; 120 General Purpose and 120 Floating Point Register mapper; 40-entry Condition Register (CR) mapper; 32-entry Fixed-point Exception Register (XER) mapper; 24-entry Floating-point issue Packed Queue (FPQ)
Power5 Processor-Memory Architecture
The Power5 architecture includes several improvements over the well-established Power4 one (see Fig. 2). A critical difference between Power4 and Power5 processors is that, the L3 cache and its control logic have been moved out of the critical path of communication between processor and main memory. This has decreased the latency in all processor-memory operations and has sped up coherence operations for the cache memories. In Power5, L3 acts as a "victim cache" for all cache blocks which need to be replaced in L2. See at the end of this page for some standard benchmark performance numbers.
p5-575 Node Architecture
Each of the 52 p5-575 nodes internally consists of a "compute planar" with 8 Dual-Core Power5+ Modules (DCMs) and an "I/O planar" providing extensive I/O capabilities. (IBM calls this particular p5-575 as an "I/O node configuration".)
Dual-Chip Module (DCM) Packaging
A DCM (see Fig. 3) packages a dual-core Power5+ chip and a 36MiB L3 cache memory at close proximity for reduced signal propagation delays and increased data exchange throughput.
p5-575 SMP Nodes Compute Plane
Eight DCMs connect together to form a 16-Way SMP via the Distributed Bus Fabric (DBF), a high-speed interconnect operating at 1/2 of the processor's speed (see Fig. 4). The DBF is a collection of buses connected in a distributed switch fabric fashion to provide cache-coherence and high-speed data exchange among the processor cores. The fact that all critical path logic (such as, memory controller, fabric bus controllers, L3 controller and directory) are on-chip within the same die as the two Power5 cores, allows the p5-575 to become a very tightly-coupled SMP. The fabric buses, which form a 2D interconnect for coherent SMP traffic, operate at 1/2 of the processor frequency and allow 8 bytes data transfer per direction. Another important advantage of the Power5+ architecture in the p5-575 implementation is the fact that the High-Performance Switch adapters directly attach via the GX+ bus on the processor-memory interconnect, bypassing expensive I/O bridging logic. The p5-575 SMP employs two such GX+ buses to support the high bandwidth that is needed by the two HPS switch ports. Each GX+ bus has more bandwidth than necessary to support one of the HPS links on the host.
Cluster Interconnect
The powerful Power5+ 575 nodes form high-performance clusters with an equally high-performance interconnect fabric, called the "High-Performance Switch" (HPS). At TAMU, 48 nodes of the hydra's cluster connect together through two HPS planes (see Fig. 5). The current HPS employs the fourth generation technology in host adapter (see Fig. 6), switch fabric (see Fig. 8), and transmission links.
Each p5-575 node connects to the two HPS fabric planes through a host-side switch adapter, called the Switch Network Interface (SNI), shown in Fig. 6. Each SNI has two full-duplex HPS transmission links, one for each switch plane. On the p5-575 side, each SNI attaches directly to two of its GX+ buses. Each GX+ is a full-duplex bus with 4 bytes per direction, which directly attaches to one Power5+ DCM (see Fig. 4), and runs at 1/3 of the processor clock. The nominal bi-directional throughput of a GX+ bus is rated at 5.067 Gibytes/sec (2.54Gibytes/sec per direction). Notice in Fig. 3, there is one GX+ bus per DCM. Each HPS link uses copper media and allows full-duplex communication, with a nominal throughput of 2 Gibytes/sec per direction. The raw throughput required by the two HPS link necessitated the connection of each SNI to two GX+ buses. The fact that the SNI adapter directly attaches to GX+ buses allows very low latency access to the node's cache and main memories. The SNI contains of a multi-threaded communications processor which supports several processor off-load capabilities to facilitate high-speed, low-latency, concurrent access to local and remote memories. One of the interesting design features of the SNI is that it can directly map user (application memory) for direct access, which can avoid expensive user to system memory intermediate data copies whenever possible.
Fig. 6 Logical view of the HPS host adapter (SNI), attaching directly to two GX+ bus connections on the host side and two Switch Ports for the two HPS links, one to each of the two HPS switch planes.
A HPS "switch-board" is a 16X16 switching fabric (see Fig. 8) which connects 16 Power5 host SNI ports together. A switch-board consists of 8 "Switch-Chips" (SCs) similar to the one shown in Fig. 9. Each 16X16 HPS switch-board is a Bi-directional Multi-Stage Interconnection Network ("BMIN") or a "Fat-Tree". At TAMU, 48 p5 575 nodes are connected together with three HPS switch-boards in a "triangular" topology as shown in Fig. 10. This particular connectivity is necessary to maintain the Fat-Tree topology for systems with more than 16 host ports. Note that Fig. 10 shows one of the two HPS planes installed at TAMU. Each transmission links is bi-directional and can carry data at 2 Gibytes / sec per direction.
Fig. 8 The 16X16 HPS Switchboard. Each one of the six TAMU switches is like this one.
Fig. 9 Logical view of the Switch Chip: 8X8 cross-bar with virtual output queuing and 32Kibytes central buffer.
Fig. 10 Schematic detail of the "triangular" HPS connection that makes up an HPS plane. Each of the three switches connects through 8 HPS links to each of the other two switches. The schematic shows only one of the two planes that TAMU has installed.
SCs implement "cut-through" switching with buffering at the "Central-Queue" only for packets whose destination output port is busy. When a destination output port is idle, packet data flow directly from the input port to this output port ("cut-through switching"). Routing in the HPS is source based and it is adaptive. The source SNI specifies up to four different routes per SC for the path portion reaching the "Least Common Ancestor" (LCA) SC. As a packet travels towards the LCA, the switching SC which receives it selects one out of a predetermined set of possible output ports to forward it. Fig. 11 presents some pertinent examples of this specific routing approach. Notice that there are four distinct paths connecting endpoints (A, B), one path for (D, E) and again four paths for endpoints (C, X).
The decision of the particular output port is based on its current load. If all output ports are busy, the packet is stored in the memory of the central buffer within the SC. A SC will use (or schedule to use) the least loaded output port among the four allowed to be be used at each switching point by the source route. When the packet reaches the LCA switch chip it "turns" and travels downward towards the destination SNI. The path from the LCA to the destination is always unique and there is no adaptivity. An input port has 8Kibytes of packet space for incoming packets. A packet is divided into "flits" (for flow-control digits) and transmitted from an output port to the downstream input port. When a flit is received, the input port sends out an Acknowledgment to the upstream output port. The output port has one flit "credit" for every flit space that is known to be available at the downstream input port. There are enough flit credits to maintain the transmission link pipeline full of flits in transit. An output port will never transmit more flits than it owns credits in order not to overflow the receiving input port. This is link-by-link "back-pressure" flow-control at the flit-level and it is common in high-speed links. Note that even though flow-control can protect the downstream input ports from overflowing, it is still possible that congestion may form within the HPS fabric. Simply speaking, if two ports request the same output port in a SC, the storage for this output Port in the central queue will immediately get full. When a SC gets congested, a so-called "tree-saturation" forms as the congestion propagates from that SC upstream towards all sending SNIs which have to use this path. The adaptive routing is one of the factors which mitigate the end-to-end congestion problem but it is well known that this is by no means a complete solution. It is LAPI which applies end-to-end congestion control so that the fabric will not get overly congested for lengthy periods of time. LAPI sends up to a number of un-acknowledged packets over the HPS but it will wait until these are eventually get acknowledged by the receiving LAPI sides. This is a "sliding-window" congestion control heuristic which throttles the rate of new packet injection into the HPS network until the older transient packets are removed and acknowledged by their destination LAPI sides. Source routes are pre-computed at HPS boot-up time and may change if any part of the network malfunctions. AIX continuously monitors the sanity of each SNI and HPS port and it re-generates routes in case a path stopped working properly.
The nominal throughput is 16Gibytes/sec per direction per switch-board pair. Note that Fig. 10 shows a fully connected single plane switch topology with 48 p5-575 nodes. TAMU has a total of 52 nodes currently available and 48 nodes are connected to the two switch planes. The above HPS connection topology requires the 48 nodes to be grouped together in three host groups, each with 16 nodes. Each group connects to another group with 8 full-duplex copper HPS links (for each HPS plane). The supports a nominal 16 Gibytes/sec inter-group communication capacity per direction per HPS plane.
Each SC has a low-latency (~59 nano-secs per packet), and it can administer automatic link-level retries, among others. There is a 0.4 micro-second latency across a switch-board. See a discussion on the HPS architecture and ways to efficiently use it in this PDF. Please note that it is a draft and under continuous development.
HPS Communication Stack
User POE (MPI) or LAPI applications can use the high-speed hardware capabilities of the SNIs and the HPS by invoking the HPS protocol stack. This is a multi-layer communications s/w which is shown in Fig. 7.
The underlying reliable communications transport protocol is called LAPI (for "Low-level Application Programming Interface"). LAPI is a "single-sided" communication protocol which executes in user-space (hence the designation "US"). LAPI executes in the same virtual memory address space and processor context as the application code invoking it. LAPI negotiates directly with the device-driver and the micro-code of the SNI to setup data transfers on behalf of the application. The advantage of protocol stacks executing in this way includes bypassing the AIX kernel, saving expensive processor mode switch and avoiding user to kernel and back data copies. LAPI can also take advantage of the cache memory to avoid DRAM access as much as possible. Specifically, when the user places data in application send buffers, and LAPI is immediately invoked to send the data out (by some MPI or LAPI call) the protocol can read this data directly off of the L2/L3 cache memory and send them off using RDMA through the SNI. On the receiving side, the SNI places ("injects") data arriving to the destination node directly into the L2 cache. MPI application code which can access this data directly off of the L2 cache which has a valid copy of them. See Fig. 12 and Fig. 13 for an illustration of how SNIs connect to the rest of the SMP nodes. This makes the notion of POE (and LoadLeveler) "SNI Affinity" important as it ensures that the sending and receiving MPI tasks runs on processors with the GX+ buses that the SNI directly attaches. See Fig. 4 for an illustration of the connection of SNIs to GX+ buses on the SMP nodes. Note that at TAMU, each SNI attaches to two separate GX+ buses simultaneously in order to have full support for the 4+4 GiBytes/sec bandwidth needed by the two HPS ports per node.
Please consult this PDF to gain a deeper insight in the inner workings of the underlying HPS stack.
LAPI supports three high-speed, user-space message transmission modes, which can be directly used by POE (MPI) or LAPI applications. The first one is called FIFO (or "Packet-Mode"), the second one, Remote Direct Memory Access (RDMA or "Bulk Transfer") mode and the third one is called Shared-Memory message passing. The last one avoids using any SNI resources at all and carries out the MPI communications by a common shared memory segment. The Shared-Memory communication is enabled by default among all POE/MPI tasks which are assigned to run within the same SMP node. LAPI also supports an interface with an underlying UDP transport layer such us those provided by any common networking stack. We will not be discussing any further these last two modes. However, performance results will be included later.
In FIFO packet-mode, MPI applications typically prepare messages into application send buffers, using LoadD and STore memory instructions. LAPI breaks these messages into 2KiB packets and stores them into SNI accessible memory called "send FIFOs" (sFIFO). This is illustrated in Fig. 12. Data in sFIFO are DMAed directly into the SNI adapter and then are injected into the HPS network, without the help of the processor. Similarly, packets arriving from the HPS are DMAed into "receive FIFOs"(rFIFO) which are then collected by LAPI to form the original message which is copied into the application receive buffers. Because the SNI can read and write directly to cache-memory coherently, data copies from application to sFIFO buffers can be avoided, if data is still in cache. The send requires at most one I/O (GX+) bus crossing but can avoid a memory access. Similarly, data from rFIFO buffers are written to application memory, but the very first cache-line. When the last FIFO packet is received, the first cache line is also written to signify the successful end of message reception. At this point, all application message contents is likely still in the cache memory and receiving application thread can be woken up to retrieve it and process it. Careful tuning of the communications protocol avoids extra data copies and crossing of the processor/memory bus, unless it is absolutely necessary. Note that Even though the FIFO mode can request to use both SNIs the benefits can be rarely realized. Please see below our POE/MPI performance results when using FIFO mode with sn_single (one SNI) vs. sn_all (all SNIs).
Using FIFO in Interactive Executions of POE Code
To invoke FIFO mode in Interactive POE, set
export MP_RESD=yes export MP_EUILIB=us export MP_EUIDEVICE=sn_single export MP_USE_BULK_XFER=no poe POEbinary [PoeOptions] [PoeBinaryOptions]Please use man poe to see a list of available options for the poe command.Using FIFO in Batch Execution of POE Code (with LoadLeveler)
To invoke FIFO mode in Load Leveler jobs, set
#@ bulkxfer=no #@ network.MPI = sn_single,usage,US,HIGH ... #@queue ... poe POEbinary [PoeOptions] [PoeBinaryOptions]In RDMA mode (see Fig. 13), LAPI programs the SNI hardware to DMA data from the sending application buffer directly to the receiving application's one, without requiring the processor to execute any protocol on the target side. However, RDMA incurs an initial LAPI "rendezvous"exchange to setup the transfer on the target side. This extends the message latency by at least one full Round-Trip time between the two end-points. Due to this initial overhead, RDMA is beneficial for larger messages, typically messages with size > 64Kibytes or size > 128 Kibytes. RDMA is not enabled by default and it requires the user to explicitly request it by the system. Note that applications with messages of size larger than 128 Kibytes could benefit from message striping, that is by requesting to use both SNIs simultaneously. RDMA transfers can achieve high throughput for large MPI (or LAPI messages) when both SNI adapters are requested. Local experiments has shown that 3Gibytes / sec per direction are possible when both SNIs are employed. On the other hand, if messages are relatively small, local performance experiments have shown that the FIFO mode is more efficient than the RDMA mode. Please see below our POE/MPI performance results when using RDMA mode and sn_single (one SNI) vs. sn_all (all SNIs).
Fig. 13 Remote Direct Memory Access (RDMA) User-Space message transmission mode of the HPS, with data-striping and fail-over across multiple SNIs.
Using RDMA in Interactive Executions of POE Code
To invoke RDMA mode in Interactive POE, set
export MP_RESD=yes export MP_EUILIB=us export MP_EUIDEVICE=sn_all export MP_USE_BULK_XFER=yes export MP_BULK_MIN_MSG_SIZE=Threshold_to_start_Using_RDMA poe POEbinary [PoeOptions] [PoeBinaryOptions]Please use man poe to see a list of available options for the poe command.Using RDMA in Batch Execution of POE Code (LoadLeveler)
To invoke RDMA mode in Load Leveler jobs, set
#@ bulkxfer=yes #@ network.MPI = sn_all,usage,US,HIGH export MP_BULK_MIN_MSG_SIZE=Threshold_to_start_Using_RDMA ... #@queue ... poe POEbinary [PoeOptions] [PoeBinaryOptions]When we specify as the HPS device "sn_single" we are requesting to use a single SNI adapter, and when we specify "sn_all" we are requesting to use all available SNI adapters. usage can be "shared" or "not_shared" for shared and not shared adapter use, respectively.Important Note 1: There is a threshold for MPI message sizes below which the system will not engage RDMA even if we have explicitly requested RDMA. By default LAPI will revert to the "FIFO packet" mode, when we attempt to send MPI (or LAPI) messages which are less than 150 KiB (153600 bytes). A user can change this threshold by assigning a different value to the MP_BULK_MIN_MSG_SIZE environment variable. Our experiments have shown RDMA benefits for messages with size as low as 64KiB. The improvement becomes more prominent as message sizes get larger than 128KiB. See the performance section below.
Important Note 2: Our experiments have shown that RDMA is beneficial only if used along with the "sn_all" specification in LL scripts or interactively with the use of environment variable MP_EUIDEVICE set to sn_all. The sn_all value instructs the HPS stack to allocate both HPS planes for use by the POE application.
Important Note 3: Our experiments have shown that FIFO packet mode and the "sn_all" declaration offers very minimal benefits even though the HPS stack will still set aside resources on both HPS planes. This increases the strain on the HPS resources on the nodes, and therefore, the SC Facility strongly discourages the use of FIFO mode with device "sn_all".
Important Note 4: There are actually two different ways to invoke POE interactively. One involves LoadLeveler and the other does not. There are easy to miss but differences which may play a big role in performance. The basic issue is that all system resources which are necessary for the User Space protocols can not be allocated directly by the users. One has to use the "Resource Manager" portion of the protocol stack to accomplish this. In our case this part is played by LoadLeveler. When we directly run POE code at the command line but the environment variable MP_RESD (or the POE option -resd) are set to"no" or not set at all, LoadLeveler is not involved. This means that even though we can launch MPI code interactively we do not have access to the high performance User Space protocols. In this situation, the protocol stack reverts to a plain socket mode of TCP/UDP or IP over HPS protocol. This is not that bad, but our experiments have shown that maximum performance is around 1Gibyte of throughput, which is a far cry from the 4Gibyte HPS's "wire speed". To be able to access the US protocol interactively we have to invoke LoadLever by setting environment variable MP_RESD=yes or using the POE option -resd yes at the command line. See below the results from our performance experiments.
Note 5: There are several performance advantages and optimizations in the HPS protocol stack which are only available to 64-bit executable code. We strongly encourage the use of 64-bit code on Hydra, POE or otherwise.
HPS Performance
We present below our preliminary performance results on the achievable "bandwidth" of MPI message passing communication over the HPS fabric. We examined the different communication modes which are available on the 1600 cluster. We present the performance of both blocking and non-blocking MPI calls using the different POE message transmission modes and HPS options. Note that there are several tuning options for POE and LAPI which we have used (e.g., single thread, polling, non-polling modes, interrupt enabled, etc.) whose detailed explanation is beyond the scope of the present write up. However, p namely
- Shared-Memory IPC for same node-tasks, with "shared" and "not_shared" node usage;
- FIFO (user space) packet mode across the HPS, using single or both SNIs, with shared and not_shared SNI adapter usage;
- RDMA (user space) bulk data transfer mode across the HPS, using single or both SNIs, with shared and not_shared SNI adapter usage; and finally
- IP (kernel space, UDP) mode which uses IP over HPS, for single or both SNIs, with shared and not_shared node usage.
The results are reported below in three PostScript files, each emphasizing a different message size region to illustrate better the performance curves. Note that this results are preliminary and under continuous development. At a later time the results will be expanded, explained and analyzed in a formal report. However, these preliminary results could already provide some good insight on the performance capabilities and used to provide rough guidelines as to how MPI code can set the various POE/LAPI options.
- PS PDF for messages up to 256KiB;
- PS PDF for messages up to 1MiB; and
- PS PDF for messages up to 32MiBs
DDN S2A9550 Cluster Storage and GPFS File Systems
hydra is directly attached to 20 Tera-bytes of disk space on a high-performance Data-Direct Network's DDN A29550 RAID array (see Fig. 14). The connection to the DDN RAID is through eight 4-Gigabit/sec fibre-channel links, with two FC links hosted by each of four I/O p5-575 server nodes. On the RAID side (see Fig. 15), each logical disk (LUN) is protected with two parity disks for increased recovery capabilities, far beyond the usual N+1 RAID configurations.
Fig. 14 The 48 p5-575 TAMU cluster with additional details for the connectivity of the four GPFS I/O server nodes to the DDN S2A9550.
Fig. 15 Details of the DDN S2A9550 RAID array.
The hydra cluster deploys the latest version of GPFS which is IBM's high-performing, highly-scalable clustered file system. Currently, /home, /usr/local, /work and /scratch are GPFS file systems available to the users of hydra and are configured with respect to different performance objectives. /work and /scratch are striped four-ways across the four GPFS I/O servers. The number of I/O servers as well as the topology of their connectivity with the rest of the systems was chosen with respect to performance. Four I/O servers can fully utilize the raw throughput capacity of the high-end DDN S2A9559 RAID. The topology of the four GPFS servers is meant to spread the file traffic evenly over the HPS paths to all nodes, as evenly as possible. One can notice in the diagram in Fig. 10 that each cluster node has the same path length distance to all four GPFS I/O servers to avoid the instance were traffic has to wait for the longest path when the four paths have unequal lengths.
I/O Performance Results for Hydra and DDN S2A9550
Performance experiments conducted after these modifications have demonstrated that now Hydra cluster is performing at (or better than) the highest published I/O throughput levels for this type of hardware and RAID Storage.
Specifically, aggregate I/O read throughput at raw device level in the I/O nodes, conducted when Hydra was under full production workload, is at ~ 2600.7 Megabytes / sec. This is very close to the "theoretical" maximum performance the vendor DDN claims that the storage array can ever achieve under ideal conditions.
Preliminary performance experiments on IBM's parallel file system (GPFS) running on hydra, have reached ~ 2323.89 Megabytes/sec sustained, which is better than the ~ 2000 Megabytes/sec. IBM has achieved with their top of the line systems under ideal conditions.
It is expected that the same experiments conducted under an idle system can reach higher levels.
Please inquire with the facility staff if you would like to have your code achieve the best possible performance with respect to GPFS I/O.
Standard Benchmark Performance Results for Power5+ and p5-575
Before the introduction of the Power6 processor, p5-575 servers using the Power5+ processor achieved top rankings in a number of standard industry and application benchmarks. The p5-575 achieved the highest SPECfp_rate2000 measurement and the highest SPECompM2001 measurement for any 8-core server. Furthermore, both the 8-core and 16-core p5-575 systems had higher memory bandwidths than other 8-core and 16-core high-density RISC-based systems, respectively . The table below shows some standard benchmarking numbers for individual p5-575 nodes. We could say that each one of the 52 p5-575 nodes is approximately 2.5 to 3 times as fast as our former Power4-based IBM p690 (agave) system.
Multi-user Performance (AIX V5.3)
| Processor(s) | GHz | L3 cache (MB) |
Proc Mem Bandwidth | SPEC (CPU2000) | SPEC web99 | SPEC web99 SSL | |||
| int rate | int rate base | fp rate | fp rate base | ||||||
| 16-core Power5+ | 1.9 | 288 | 202.7 GB/sec | 314 | 310 | 571 | 541 | ||
SPEC and LINPACK Performance (AIX V5.3)
| Processor(s) | GHz | L3 cache (MB) |
SPEC (CPU2000) | LINPACK | |||||
| int | int base | fp | fp base | DP | TPP | HPC | |||
| 1-core Power5+ | 1.9 | 36 | 1,526 | 1,473 | 3,042 | 2,830 | 1,315 | 7,140 | |
| 16-core Power5+ | 1.9 | 288 | 111,400 | ||||||
Cluster Configuration Summary
| Component | Specifications | |||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Total number of processors | 832 Power5+ at 1.9GHz | |||||||||||||||||||||||||||||
| Total physical memory | 1,760 Gibytes DDR2 DRAM at 533MHz | |||||||||||||||||||||||||||||
| Memory architecture within p5-575 node | Cache Coherent, Almost-Uniform Memory Access (cc-AUMA) | |||||||||||||||||||||||||||||
| Main Memory (DRAM) | 32 GB (49 p5-575 nodes), 64 GB (3 p5 575 nodes); 64 DIMMs at 533 MHz 2 SMI II (with 4 DIMMs per SMI) at 1066MHz per SMI per Power5+ chip; | |||||||||||||||||||||||||||||
| Memory architecture across p5-575 nodes | Distributed-Memory cluster supporting Message-Passing communications via the High-Performance Switch for MPI and LAPI code | |||||||||||||||||||||||||||||
| Operating system | AIX Version 5.3 with Parallel Environment, CSM, RSCT and GPFS. | |||||||||||||||||||||||||||||
| Processor type, ISA | 1.9 GHz IBM Power5+ ® μ-processor; 64-bit PPC Architecture; Big Endian | |||||||||||||||||||||||||||||
| Cache memories (on processor die) |
| |||||||||||||||||||||||||||||
| Number of physical processors / node | 16 [Fig. 1] | |||||||||||||||||||||||||||||
| Size of local memory / node | 32 Gigabytes DDR2 at 533MHz [Details in Fig. 4] | |||||||||||||||||||||||||||||
| Number of Processors per node | 16 (8 X DCMs) [DCM Fig. 4] | |||||||||||||||||||||||||||||
| Number of p5-575 nodes | 52 | |||||||||||||||||||||||||||||
| Cache coherence protocol within node | Enhanced-Distributed Switch (cc-AlmostUMA) | |||||||||||||||||||||||||||||
| Interconnection between p5-575 nodes for message passing | IBM® High-Performance SwitchTM
|
|||||||||||||||||||||||||||||
| Number of I/O nodes | 4 GPFS I/O server nodes; each with two 4Gb/sec Fibre Channel adapter | |||||||||||||||||||||||||||||
| I/O and PCI-X slots / node | Per p5-575 node:
| |||||||||||||||||||||||||||||
| Networking | Four 1-Gigabit (copper) ethernet ports per node | |||||||||||||||||||||||||||||
| System disks | Two 146.8 GB SCSI disks (40 nodes); Two 73.4 GB SCSI disks (12 nodes) | |||||||||||||||||||||||||||||
| Disk expansion unit | DDN S2A9550 20+ Terabyte RAID array (2 parity disks/LUN) |
References
Published Literature for IBM Cluster 1600 and Power5+
- B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner," POWER5 System Microarchitecture," in IBM Journal of Research and Development, Vol. 49, Num. 4/5, 2005, pp. 505--521.
- H. M. Mathis, A. E. Mericas, J. D. McCalpin, R. J. Eickemeyer, and S. R. Kunkel, "Characterization of Simultaneous Multithreading (SMT) Efficiency in POWER5," in IBM Journal of Research and Development, Vol. 49, Num. 4/5, 2005, pp. 555--564.
- Rajeev Sivaram, Rama K. Govindaraju Peter, Hochschild Robert, Blackmore, Piyush Chaudhary, "Breaking the Connection: RDMA Deconstructed," in Proc. of the 13th IEEE Symp. on High Performance Interconnects, (HOTI'05), 2005.
- D. W. Victor, J. M. Ludden, R. D. Peterson, B. S. Nelson, W. K. Sharp, J. K. Hsu, B.-L. Chu, M. L. Behm, R. M. Gott, A. D. Romonosky and S. R. Farago, "Functional Verification of the POWER5 Microprocessor and POWER5 Multiprocessor Systems," in IBM Journal of Research and Development, Vol. 46, Num. 1, 2005, pp. 541--553.
- R. M. Gott, J. R. Baumgartner, P. Roessler and S. I. Joe, "Functional Formal Verification on Designs of pSeries Microprocessors and Communication Subsystems," in IBM Journal of Research and Development, Vol. 49, Num. 4/5, 2005, pp. 505--521.
- R. K Govindaraju, P. Hochschild, D. Grice, K. Gildea, R. Blackmore, C. A. Bender, C. Kim, P. Chaudhary, J. Goscinski, J. Herring, S. Martin, J. Houston, "Architecture and Early Performance of the New IBM HPS Fabric and Adapter," in Lecture Notes in Computer Science, Volume 3296, pp. 156--165, Springer-Verlag, Dec 2004.
- J. M. Tendler, J. S. Dodson, J. S. Fields, Jr., H. Le and B. Sinharoy, "POWER4 System Microarchitecture," in IBM Journal of Research and Development, Vol. 46, Num. 1, 2002, pp. 5--26.
- M. Banikazemi, R. K. Govindaraju, R. Blackmore, and D. K. Panda, "MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems," in IEEE Trans. Par. Distr. Systems, Vol. 12, Num. 10, pp. 1081--1093, 2001.
- Bulent Abali, Craig B. Stunkel, Jay Herring, Mohammad Banikazemi, Dhabaleswar K. Panda, Cevdet Aykanat and Yucel Aydogan, "Adaptive Routing on the New Switch Chip for IBM SP Systems", in Journal of Parallel and Distributed Computing, Vol. 61, Num. 9, pp. 1148--1179, Sep. 2001.
- G. Shah, J. Nieplocha, J. H. Mirza, C. Kim, R. J. Harrison, R. K. Govindaraju, K. J. Gildea, P. DiNicola, and C. A. Bender, "Performance and Experience with LAPI -- A New High-Performance Communication Library for the IBM RS/6000 SP," in Proc. of IEEE Combined IPPS/SPDP 1998, pp. 260--266.
- Y C. B. Stunkel, J. Herring, B. Abali and R. Sivaram, "A New Switch Chip for IBM RS/6000 SP Systems", in Proc. ACM/IEEE 1999 Supercomputing Conference, pp. 16, 13-18 Nov. 1999.
- Bulent Abali, "A Deadlock Avoidance Method for Computer Networks," in Proc. CANPC '97: First International Workshop on Communication and Architectural Support for Network-Based Parallel Computing, pp. 61--72, Springer-Verlag, 1997, London, UK.
- M. Snir, P. Hochschild, D. D. Frye, K. J. Gildea, "The communication software and parallel environment of the IBM SP2," IBM Systems Journal, Vol. 34, Issue 2, page 205, 1995.
- C. B. Stunkel, D. G. Shea, B. Abali, M. G. Atkins, C. A. Bender, D. G. Grice, P. Hochschild, D. J. Joseph, B. J. Nathanson, R. A. Swetz, R. F. Stucke, M. Tsao, P. R. Varker, "The SP2 High-Performance Switch," IBM Systems Journal, Vol. 34, Issue 2, 1995.
- C. B. Stunkel, D. G. Shea, D.G. Grice, P. H. Hochschild and M. Tsao, "The SP1 High-Performance Switch," in Proc. of Scalable High-Performance Computing Conference, pp. 150--157, May, 1994.
- C. B. Stunkel, D. G. Shea, B. Abali, M. M. Denneau, P. H. Hochschild, D. J. Joseph, B. J. Nathanson, M. Tsao and P. R. Varker, "Architecture and Implementation of Vulcan, in Proc. of the Inte'l Par. Proc. Symp." (IPPS94) , April 1994.
IBM RedBooks and White Papers
- IBM RedBook: An Introduction to the New IBM eServer pSeries High Performance Switch
- IBM WhitePaper: IBM High Performance Switch on System p5 575 Server - Performance, White Paper
- IBM WhitePaper: IBM eServer pSeries High Performance Switch - Tuning and Debug Guide
- IBM Reference Documentation: Switch Network Interface for eServer pSeries High Performance Switch Guide and Reference
- IBM Reference Documentation: RSCT LAPI Programming Guide
Collection of IBM Documentation, Technical Manuals and Libraries
here.Abbreviation Key
- KiB := 210 ("Kilo-binary-Byte")
- MiB := 220 ("Mega-binary-Byte")
- GiB := 230 ("Giga-binary-Byte")
- TiB := 230 ("Giga-binary-Byte")
- KB := 103 ("Kilo-Byte")
- MB := 106 ("Mega-Byte")
- GB := 109 ("Giga-Byte")
- TB := 109 ("Tera-Byte")
Notice
The discussions in this page relies on numerous technical sources, including, papers published in the research literature, IBM technical reports, manuals and personal communications with IBM developers and researchers. See References section above. The contents is responsibility of Michael E. Thomadakis and along with the original artwork remain (C) copyright of his and of Texas A & M University's. Figure 15 is courtesy of Data Direct Networks, Inc. Any of the contents of this page can be freely used for educational purposes, as long as the copyright notice remains visible and the original author is cited.
Disclaimer: This page is Under Construction. Visit often for corrections and additions. Contact me at miket AT tamu.edu or at miket AT sc.tamu.edu for corrections and additions.