|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
hydra.tamu.edu -- High-Performance
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
hydra.tamu.edu is a high-performance "IBM cluster 1600", based on IBM's Power5+ processor. The cluster consists of 40 p5-575 nodes, each having 16 Power5+ processors running at 1.9GHz and 32 GBytes of DDR2 DRAM. A p5-575 node, is a high-performance, Shared-Memory multi-processor (SMP), running the 64-bit version of AIX 5L (5.3) as a single system image. Power5+ processors implement the 64-bit PowerPC instruction set architecture (ISA). A Power5+ chip contains two identical processor cores within a single die (see Fig. 1). A power5+ processor-memory module consists of a dual-core processor chip (with both cores active, operating at 1.9 GHz), eight memory DIMM slots and a private, high-performance, custom 36MiB Level 3 (L3) cache memory. Each processor chip contains two active processor cores, a shared 1.9MiB Level 2 cache, the memory controller and L3 cache directory. The L2 and L3 are divided into three slices which can be accessed independently by any of the cores. All processor chip on-board-facilities operate at the 1.9 GHz frequency. However, the memory controller operates asynchronously, with the memory interface running at 1066 MHz to connect with the 533 MHz DDR2 memory. The high performance of the Power5+ architecture relies on the fact that the Fabric-Bus logic, the DRAM controller, the L3 controller and directory are housed within the same die as the two processor cores. This saves a significant amount of off-chip communications and makes possible a tightly coupled low-latency, high bandwidth SMP system. The Power5 architecture includes several improvements over the well-established Power4 one (see Fig. 2). A critical difference between Power4 and Power5 processors is that, the L3 cache and its control logic have been moved out of the critical path of communication between processor and main memory. This has decreased the latency in all processor-memory operations and has sped up coherence operations for the cache memories. In Power5, L3 acts as a "victim cache" for all cache blocks which need to be replaced in L2. See at the end of this page for some standard benchmark performance numbers. | ||||
A Power5 core supports Simultaneous Multi-Threading (SMT) under which two hardware threads can execute simultaneously within each core, by sharing its resources. The objective of SMT is to allow the 2nd h/w thread utilize functional units in a core which the 1st h/w thread leaves idle. If sufficient functional units are available, both threads can "simultaneously" (i.e., at the same clock period) utilize them, making progress. The alternative would be to let a thread run until it has to stall (e.g., waiting for a lengthy FP operation or a cache memory miss to be handled), at which point in time the OS dispatcher would have to carry out a costly context-switching operation with processor state swapping. SMT exploits concurrency at a very fine level and it obviates the need for a context-switching. When SMT is ON, each Power5 core appear as two logical processors. An SMT ON p5-575 node appears as 32 logical processors to AIX. SMT enabled processors offer performance advantages to certain types of computation mixes when the average run-queue length > 16 and when there is no contention for common functional units by the two h/w threads within the core and L1 and L2 cache memory blocks within the same Power5 core. If used wisely, SMT increases the completion rate for instructions within a system. Studies suggest that the benefit accrued for certain applications can reach up to %15 to %35. SMT currently appears to be the next step in the architectural evolution of super-scalar processors where processors have started reaching clock frequency scalability limits. It is expected that SMT will not only increase the utilization of idle functional units but also make the ratio of Flop/Watt more favorable. However, SMT is not a panacea and may not benefit all types of codes. It has been shown to benefit code with instructions experiencing longer than average clocks-per-instruction rate. There have also been cases where code experiences a relatively mild slow-down. Code with heavy usage of the floating point units or the L1 and L2 caches most likely wont benefit from SMT. Compilers and Operating Systems must now take into consideration these new parameters if SMT is to be used beneficially. Even though SMT is not a new area, more investigation is necessary to make its deployment more readily usable.
Each of the 40 p5-575 nodes internally consists of a "compute planar" with 8 Dual-Core Power5+ Modules (DCMs) and an "I/O planar" providing extensive I/O capabilities. (IBM calls this particular p5-575 as an "I/O node configuration".) A DCM (see Fig. 3) packages a dual-core Power5+ chip and a 36MiB L3 cache memory at close proximity for reduced signal propagation delays and increased data exchange throughput. 8 DCMs connect together to form a 16-Way SMP via the Distributed Bus Fabric a high-speed interconnect operating at 1/2 of the processor's speed (see Fig. 4). The DBF is a collection of buses connected in a distributed switch fabric fashion to provide cache-coherence and high-speed data exchange among the processor cores. The fact that all critical path logic (such as, memory controller, fabric bus controllers, L3 controller and directory) are on-chip within the same die as the two power5 cores, allows the p5-575 to become a very tightly-coupled SMP. The fabric buses, which form a 2D interconnect for coherent SMP traffic, operate at 1/2 of the processor frequency and allow 8 bytes data transfer per direction. Another important advantage of the Power5+ architecture in the p5-575 implementation is the fact that the High-Performance Switch adapters directly attach via the GX+ bus on the processor-memory interconnect, bypassing expensive I/O bridging logic. The p5-575 SMP employs two such GX+ buses to support the high bandwidth that is needed by the two HPS switch ports. Each GX+ bus has more bandwidth than necessary to support one of the HPS links on the host. | |||||
The powerful Power5+ 575 nodes form high-performance clusters with an equally high-performance interconnect fabric, called the ``High-Performance Switch'' (HPS). At TAMU, the 40 nodes of the hydra's cluster connect together through two HPS planes (see Fig. 5). The current HPS employs the fourth generation technology in host adapter (see Fig. 7), switch fabric (see Fig. 8), and transmission links.
Each p5-575 node connects to the two HPS fabric planes through a host-side switch adapter, called Switch Network Interface (SNI), shown in Fig. 7. Each SNI has two full-duplex HPS transmission links, one for each of the switch planes. On the p5-575 side, each SNI attaches directly to two of its GX+ buses. Each GX+ is a full-duplex bus with 4 bytes per direction, which directly attaches to one Power5+ DCM (see Fig. 4), and runs at 1/3 of the processor clock. The nominal bi-directional throughput of a GX+ bus is rated at 5.067 Gibytes/sec (2.54Gibytes/sec per direction). Notice in Fig. 3, there is one GX+ bus per DCM. Each HPS link uses copper media and allows full-duplex communication, with a nominal throughput of 2 Gibytes/sec per direction. The raw throughput required by the two HPS link necessitated the connection of each SNI to two GX+ buses. The fact that the SNI adapter directly attaches to GX+ buses allows very low latency access to the node's cache and main memories. The SNI contains of a multi-threaded communications processor which supports several processor off-load capabilities to facilitate high-speed, low-latency, concurrent access to local and remote memories. One of the interesting design features of the SNI is that it can directly map user (application memory) for direct access, which can avoid expensive user to system memory intermediate data copies whenever possible.
The h/w capabilities of the SNIs and the HPS are utilized by the HPS communications protocol stack, which is shown in Fig. 6. The underlying reliable communications transport protocol is called LAPI (for ``Low-Level Application Programming Interface''). LAPI is a ``single-sided'' communication protocol which executes in ``User-Space'', that is in the same processor context as the application code invoking it. LAPI negotiates directly with the device-driver and the micro-code of the SNI to setup data transfers on behalf of the application.
LAPI supports three high-speed, user-space message transmission modes, which can be directly used by POE (MPI or LAPI) applications. The first one is called FIFO (or ``Packet-Mode''), the second one, Remote Direct Memory Access (RDMA or ``Bulk Transfer'') mode and the third one is called Shared-Memory message passing for all POE/MPI tasks when all run within the same SMP node.
A HPS ``switch-board'' is a 16X16 switching fabric (see Fig. 8) which connects 16 Power5 host SNI ports together. A switch-board consists of 8 "Switch-Chips" (SCs) similar to the one shown in Fig. 9. Each 16X16 HPS switch-board is a Bi-directional Multi-Stage Interconnection Network ("BMIN") or a "Fat-Tree". At TAMU the 40 575 hosts are connected together with three HPS switch-boards in a "triangular" topology as shown in Fig. 10. This particular connectivity is necessary to maintain the Fat-Tree topology for systems with more than 16 host ports. Note that Fig. 10 shows one of the two HPS planes installed at TAMU. Each transmission links is bi-directional and can carry data at 2 Gibytes / sec per direction. SCs implement "cut-through" switching with buffering at the "Central-Queue" only for packets whose destination output port is busy. When a destination output port is idle, packet data flow directly from the input port to this output port ("cut-through switching"). Routing in the HPS is source based and it is adaptive. The source SNI specifies up to four different routes per SC for the path portion reaching the "Least Common Ancestor" (LCA) SC. As a packet travels towards the LCA, the switching SC which receives it selects one out of a predetermined set of possible output ports to forward it. Fig. 11 presentrs some pertinent examples of this specific routing approach. Notice that there are four distinct paths connecting endpoints (A, B), one path for (D, E) and again four paths for endpoints (C, X). The decision of the particular output port is based on its current load. If all output ports are busy, the packet is stored in the memory of the central buffer within the SC. A SC will use (or schedule to use) the least loaded output port among the four allowed to be be used at each switching point by the source route. When the packet reaches the LCA switch chip it "turns" and travels downward towards the destination SNI. The path from the LCA to the destination is always unique and there is no adaptivity. An input port has 8Kibytes of packet space for incoming packets. A packet is divided into "flits" (for flow-control digits) and transmitted from an output port to the downstream input port. When a flit is received, the input port sends out an Acknowledgment to the upstream output port. The output port has one flit "credit" for every flit space that is known to be available at the downstream input port. There are enough flit credits to maintain the transmission link pipeline full of flits in transit. An output port will never transmit more flits than it owns credits in order not to overflow the receiving input port. This is link-by-link "back-pressure" flow-control at the flit-level and it is common in high-speed links. Note that even though flow-control can protect the downstream input ports from overflowing, it is still possible that congestion may form within the HPS fabric. Simply speaking, if two ports request the same output port in a SC, the storage for this output Port in the central queue will immediately get full. When a SC gets congested, a so-called "tree-saturation" forms as the congestion propagates from that SC upstream towards all sending SNIs which have to use this path. The adaptive routing is one of the factors which mitigate the end-to-end congestion problem but it is well known that this is by no means a complete solution. It is LAPI which applies end-to-end congestion control so that the fabric will not get overly congested for lengthy periods of time. LAPI sends up to a number of un-acknowledged packets over the HPS but it will wait until these are eventually get acknowledged by the receiving LAPI sides. This is a "sliding-window" congestion control heuristic which throttles the rate of new packet injection into the HPS network until the older transient packets are removed and acknowledged by their destination LAPI sides. Source routes are pre-computed at HPS boot-up time and may change if any part of the network malfunctions. AIX continuously monitors the sanity of each SNI and HPS port and it re-generates routes in case a path stopped working properly. The nominal throughput is 16Gibytes/sec per direction per switch-board pair. Note that Fig. 10 shows a fully connected single plane switch topology with 48 p5-575 nodes. TAMU has a total of 40 nodes currently available and two switch planes. The above HPS connection topology requires the 40 nodes to be grouped together in three host groups, which at TAMU currently consist of 14, 12 and 12 nodes, respectively. Each group connects to another group with 8 full-duplex copper HPS links (for each HPS plane). The supports a nominal 16 Gibytes/sec inter-group communication capacity per direction per HPS plane. Each SC has a low-latency (~59 nano-secs per packet), and it can administer automatic link-level retries, among others. There is a 0.4 micro-second latency across a switch-board. | |||||
| See a discussion on the HPS architecture and ways to efficiently use it in this PDF. Please note that it is a draft and under continuous development. | |||||
We present below our preliminary performance results on the communication ``bandwidth''
of MPI over the HPS fabric. We present the performance of both blocking and
non-blocking MPI calls using the different POE
message transmission modes and HPS options, namely
The results are reported below in three PostScript files, each emphasizing a different message size region to illustrate better the performance curves. Note that this results are preliminary and under continuous development. At a later time the results will be expanded, explained and analyzed in a formal report. However, these preliminary results could already provide some good insight on the performance capabilities and used to provide rough guidelines as to how MPI code can set the various POE/LAPI options. |
|||||
|
hydra is directly attached to 20+ Tera-bytes of disk space on a high-performance Data-Direct Network's DDN A29550 RAID array (see Fig. 14). The connection to the DDN RAID is through eight 4-Gigabit/sec fibre-channel links, with two FC links hosted by each of four I/O p5-575 server nodes. On the RAID side (see Fig. 15), each logical disk (LUN) is protected with two parity disks for increased recovery capabilities, far beyond the usual N+1 RAID configurations. The hydra cluster deploys the latest version of GPFS which is IBM's high-performing, highly-scalable clustered file system. Currently, /home, /usr/local, /work and /scratch are GPFS file systems available to the users of hydra and are configured with respect to different performance objectives. /work and /scratch are striped four-ways across the four GPFS I/O servers. The number of I/O servers as well as the topology of their connectivity with the rest of the systems was chosen with respect to performance. Four I/O servers can fully utilize the raw throughput capacity of the high-end DDN S2A9559 RAID. The topology of the four GPFS servers is meant to spread the file traffic evenly over the HPS paths to all nodes, as evenly as possible. One can notice in the diagram in Fig. 10 that each cluster node has the same path length distance to all four GPFS I/O servers to avoid the instance were traffic has to wait for the longest path when the four paths have unequal lengths. I/O Performance Results for Hydra and DDN S2A9550 During the last months, Hydra, our IBM 1600 Cluster and our Data-Direct Networks S2A9550 storage, went through a series of h/w I/O reconfigurations and system tunings. Performance experiments conducted after these modifications have demonstrated that now Hydra cluster is performing at (or better than) the highest published I/O throughputs levels for this type of h/w and RAID Storage. Specifically, aggregate I/O read throughput at raw device level in the I/O nodes, conducted when Hydra was under full production workload, is at ~ 2600.7 Megabytes / sec. This is very close to the "theoretical" maximum performance the vendor DDN claims that the storage array can ever achieve under ideal conditions. Preliminary performance experiments on IBM's parallel file system (GPFS) running on hydra, have reached ~ 2323.89 Megabytes/sec sustained, which is better than the ~ 2000 Megabytes/sec IBM has achieved with their top of the line systems under ideal conditions. It is expected that the same experiments conducted under an idle system can reach higher levels. Please inquire with the facility staff if you would like to have your code achieve the best possible performance with respect to GPFS I/O. p5-575 servers have achieved top rankings in a number of standard industry and application benchmarks. The p5-575 has achieved the highest SPECfp_rate2000 measurement and the highest SPECompM2001 measurement for any 8-core server. Furthermore, both the 8-core and 16-core p5-575 systems have higher memory bandwidths than other 8-core and 16-core high-density RISC-based systems, respectively. Take a look at the end of this page to get some standard benchmarking numbers for individual p5-575 nodes. We could say that each one of the 40 p5-575 nodes is approximately 2.5 to 3 times as fast as our Power4-based IBM p690 (agave) system. (**) Notice: The discussions in this page relies on numerous technical sources, including, papers published in the research literature, IBM technical reports, manuals and personal communications with IBM developers and researchers. See References section below. The contents is responsibility of Michael E. Thomadakis and along with the original artwork remain (C) copyright of his and of Texas A & M University's. Figure 15 is courtesy of Data Direct Networks, Inc. Any of the contents of this page can be freely used for educational purposes, as long as the copyright notice remains visible and the original author is cited. Disclaimer: This page is Under Construction. Visit often for corrections and additions. Contact me at miket AT tamu.edu or at miket AT sc.tamu.edu for corrections and additions. See also hydra's user guide for information on compiling and running interactive and batch jobs. |
|||||
Table 1. Power5+ and p5-575 Configuration -- Summary
| Component | Specifications | |||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Total number of processors | 640 Power5+ at 1.9GHz | |||||||||||||||||||||||||||||
| Total physical memory | 1,280 Gibytes DDR2 DRAM at 533MHz | |||||||||||||||||||||||||||||
| Memory architecture within p5-575 node | Cache Coherent, Almost-Uniform Memory Access (cc-AUMA) | |||||||||||||||||||||||||||||
| Main Memory (DRAM) | 32 GB per p5-575; 64 DIMMs at 533 MHz 2 SMI II (with 4 DIMMs per SMI) at 1066MHz per SMI per Power5+ chip; | |||||||||||||||||||||||||||||
| Memory architecture across p5-575 nodes | Distributed-Memory cluster supporting Message-Passing communications via the High-Performance Switch for MPI and LAPI code | |||||||||||||||||||||||||||||
| Operating system | AIX Version 5.3 with Parallel Environment, CSM, RSCT and GPFS. | |||||||||||||||||||||||||||||
| Processor type, ISA | 1.9 GHz IBM Power5+ ® μ-processor; 64-bit PPC Architecture; Big Endian | |||||||||||||||||||||||||||||
| Cache memories (on processor die) |
| |||||||||||||||||||||||||||||
| Number of physical processors / node | 16 [Fig. 1] | |||||||||||||||||||||||||||||
| Size of local memory / node | 32 Gigabytes DDR2 at 533MHz [Details in Fig. 4] | |||||||||||||||||||||||||||||
| Number of Processors per node | 16 (8 X DCMs) [DCM Fig. 4] | |||||||||||||||||||||||||||||
| Number of p5-575 nodes | 40 | |||||||||||||||||||||||||||||
| Cache coherence protocol within node | Enhanced-Distributed Switch (cc-AlmostUMA) | |||||||||||||||||||||||||||||
| Interconnection between p5-575 nodes for message passing |
IBM® High-Performance Switch
|
|||||||||||||||||||||||||||||
| Number of I/O nodes | 4 GPFS I/O server nodes; each with two 4Gb/sec Fibre Channel adapter | |||||||||||||||||||||||||||||
| I/O and PCI-X slots / node | Per p5-575 node
|
|||||||||||||||||||||||||||||
| Networking | Four 1-Gigabit (copper) ethernet ports per node | |||||||||||||||||||||||||||||
| System disks | Two 146.8GB scsi disks per node | |||||||||||||||||||||||||||||
| Disk expansion unit | DDN S2A9550 20+ Terabyte RAID array (2 parity disks/LUN) |
![]() Fig. 8 The 16X16 HPS Switchboard. Each one of the six TAMU switches is like this one. |
![]() Fig. 9 Logical view of the Switch Chip: 8X8 cross-bar with virtual output queuing and 32Kibytes central buffer. |
![]() Fig. 11 Adaptive Least-Common Ancestor Routing examples as they apply to HPS. |
![]() Fig. 12 Packet-Mode "FIFO" User-Space message transmission mode of the HPS. |
![]() Fig. 13Remote Direct Memory Access (RDMA) User-Space message transmission mode of the HPS, with data-striping and fail-over across multiple SNIs. |
![]() Fig. 14 The 40 p5-575 TAMU cluster with additional details for the connectivity of the four GPFS I/O server nodes to the DDN S2A9550. |
![]() Fig. 15 Details of the DDN S2A9550 RAID array. |
References
|
Abbreviation Key
|
| Multi-user Performance (AIX V5.3) | ||||||||||||||||||||
| Processor | GHz | L3 cache (MB) |
Proc Mem BW |
SPEC (CPU2000) | SPEC web99 |
SPEC web99 SSL |
||||||||||||||
| int rate |
int rate base |
fp rate |
fp rate base |
|||||||||||||||||
| 16-core POWER5+ | 1.9 | 288 | 202.7GB/sec | 314 | 310 | 571 | 541 | -- | -- | |||||||||||
| SPEC and LINPACK Performance (AIX V5.3) | ||||||||||||||||||
| Processor | GHz | L3 cache (MB) |
SPEC (CPU2000) | LINPACK | ||||||||||||||
| int | int base |
fp | fp base |
DP | TPP | HPC | ||||||||||||
| 1-core POWER5+ | 1.9 | 36 | 1,526 | 1,473 | 3,042 | 2,830 | 1,315 | -- | 7,140 | |||||||||
| 16-core POWER5+ | 1.9 | 288 | -- | -- | -- | -- | -- | -- | 111,400 | |||||||||
|
Read our privacy policy Document last modified: [Monday March 31, 2008] |
Powered by the
Apache WebServer Site maintained by webmaster@sc.tamu.edu |