Texas A&M Supercomputing Facility Home Texas A&M

Home
Technical Assistance
Documentation
User Research
Short Courses
Governance
Systems
Software
Accounts
Policies
Machine Status
Job Openings
Usage Statistics
Affiliations
Search

hydra.tamu.edu -- High-Performance
IBM p5-575 Cluster 1600
and
DataDirect Networks S2A9550 Storage
by Michael E. Thomadakis, Ph.D., SC staff (**)

hydra.tamu.edu is a high-performance "IBM cluster 1600", based on IBM's Power5+ processor. The cluster consists of 40 p5-575 nodes, each having 16 Power5+ processors running at 1.9GHz and 32 GBytes of DDR2 DRAM. A p5-575 node, is a high-performance, Shared-Memory multi-processor (SMP), running the 64-bit version of AIX 5L (5.3) as a single system image.

Power5+ processors implement the 64-bit PowerPC instruction set architecture (ISA). A Power5+ chip contains two identical processor cores within a single die (see Fig. 1). A power5+ processor-memory module consists of a dual-core processor chip (with both cores active, operating at 1.9 GHz), eight memory DIMM slots and a private, high-performance, custom 36MiB Level 3 (L3) cache memory. Each processor chip contains two active processor cores, a shared 1.9MiB Level 2 cache, the memory controller and L3 cache directory. The L2 and L3 are divided into three slices which can be accessed independently by any of the cores. All processor chip on-board-facilities operate at the 1.9 GHz frequency. However, the memory controller operates asynchronously, with the memory interface running at 1066 MHz to connect with the 533 MHz DDR2 memory. The high performance of the Power5+ architecture relies on the fact that the Fabric-Bus logic, the DRAM controller, the L3 controller and directory are housed within the same die as the two processor cores. This saves a significant amount of off-chip communications and makes possible a tightly coupled low-latency, high bandwidth SMP system.

The Power5 architecture includes several improvements over the well-established Power4 one (see Fig. 2). A critical difference between Power4 and Power5 processors is that, the L3 cache and its control logic have been moved out of the critical path of communication between processor and main memory. This has decreased the latency in all processor-memory operations and has sped up coherence operations for the cache memories. In Power5, L3 acts as a "victim cache" for all cache blocks which need to be replaced in L2. See at the end of this page for some standard benchmark performance numbers.


Fig. 1 A Power5+ chip with two cores, L3 cache and DRAM controller, and Enhanced Distributed Switch for coherent SMP traffic.

A Power5 core supports Simultaneous Multi-Threading (SMT) under which two hardware threads can execute simultaneously within each core, by sharing its resources. The objective of SMT is to allow the 2nd h/w thread utilize functional units in a core which the 1st h/w thread leaves idle. If sufficient functional units are available, both threads can "simultaneously" (i.e., at the same clock period) utilize them, making progress. The alternative would be to let a thread run until it has to stall (e.g., waiting for a lengthy FP operation or a cache memory miss to be handled), at which point in time the OS dispatcher would have to carry out a costly context-switching operation with processor state swapping. SMT exploits concurrency at a very fine level and it obviates the need for a context-switching.

When SMT is ON, each Power5 core appear as two logical processors. An SMT ON p5-575 node appears as 32 logical processors to AIX. SMT enabled processors offer performance advantages to certain types of computation mixes when the average run-queue length > 16 and when there is no contention for common functional units by the two h/w threads within the core and L1 and L2 cache memory blocks within the same Power5 core. If used wisely, SMT increases the completion rate for instructions within a system. Studies suggest that the benefit accrued for certain applications can reach up to %15 to %35.

SMT currently appears to be the next step in the architectural evolution of super-scalar processors where processors have started reaching clock frequency scalability limits. It is expected that SMT will not only increase the utilization of idle functional units but also make the ratio of Flop/Watt more favorable. However, SMT is not a panacea and may not benefit all types of codes. It has been shown to benefit code with instructions experiencing longer than average clocks-per-instruction rate. There have also been cases where code experiences a relatively mild slow-down. Code with heavy usage of the floating point units or the L1 and L2 caches most likely wont benefit from SMT. Compilers and Operating Systems must now take into consideration these new parameters if SMT is to be used beneficially. Even though SMT is not a new area, more investigation is necessary to make its deployment more readily usable.

Fig. 2 Architectural differences between Power4 and Power5 based systems.

Fig. 3 A Dual-Core Module (DCM) with 2 Power5+ cores and a shared 36MiB L3 cache.

Each of the 40 p5-575 nodes internally consists of a "compute planar" with 8 Dual-Core Power5+ Modules (DCMs) and an "I/O planar" providing extensive I/O capabilities. (IBM calls this particular p5-575 as an "I/O node configuration".)

A DCM (see Fig. 3) packages a dual-core Power5+ chip and a 36MiB L3 cache memory at close proximity for reduced signal propagation delays and increased data exchange throughput. 8 DCMs connect together to form a 16-Way SMP via the Distributed Bus Fabric a high-speed interconnect operating at 1/2 of the processor's speed (see Fig. 4). The DBF is a collection of buses connected in a distributed switch fabric fashion to provide cache-coherence and high-speed data exchange among the processor cores. The fact that all critical path logic (such as, memory controller, fabric bus controllers, L3 controller and directory) are on-chip within the same die as the two power5 cores, allows the p5-575 to become a very tightly-coupled SMP. The fabric buses, which form a 2D interconnect for coherent SMP traffic, operate at 1/2 of the processor frequency and allow 8 bytes data transfer per direction. Another important advantage of the Power5+ architecture in the p5-575 implementation is the fact that the High-Performance Switch adapters directly attach via the GX+ bus on the processor-memory interconnect, bypassing expensive I/O bridging logic. The p5-575 SMP employs two such GX+ buses to support the high bandwidth that is needed by the two HPS switch ports. Each GX+ bus has more bandwidth than necessary to support one of the HPS links on the host.


Fig. 4 A 16-way p5-575 node (8 DCMs with 2 Power5+ cores per DCM).

The powerful Power5+ 575 nodes form high-performance clusters with an equally high-performance interconnect fabric, called the ``High-Performance Switch'' (HPS). At TAMU, the 40 nodes of the hydra's cluster connect together through two HPS planes (see Fig. 5). The current HPS employs the fourth generation technology in host adapter (see Fig. 7), switch fabric (see Fig. 8), and transmission links.

Fig. 5 The 40 node hydra p5-575 cluster with only one of the HPS planes shown.

Each p5-575 node connects to the two HPS fabric planes through a host-side switch adapter, called Switch Network Interface (SNI), shown in Fig. 7. Each SNI has two full-duplex HPS transmission links, one for each of the switch planes. On the p5-575 side, each SNI attaches directly to two of its GX+ buses. Each GX+ is a full-duplex bus with 4 bytes per direction, which directly attaches to one Power5+ DCM (see Fig. 4), and runs at 1/3 of the processor clock. The nominal bi-directional throughput of a GX+ bus is rated at 5.067 Gibytes/sec (2.54Gibytes/sec per direction). Notice in Fig. 3, there is one GX+ bus per DCM. Each HPS link uses copper media and allows full-duplex communication, with a nominal throughput of 2 Gibytes/sec per direction. The raw throughput required by the two HPS link necessitated the connection of each SNI to two GX+ buses. The fact that the SNI adapter directly attaches to GX+ buses allows very low latency access to the node's cache and main memories. The SNI contains of a multi-threaded communications processor which supports several processor off-load capabilities to facilitate high-speed, low-latency, concurrent access to local and remote memories. One of the interesting design features of the SNI is that it can directly map user (application memory) for direct access, which can avoid expensive user to system memory intermediate data copies whenever possible.

Fig. 6 The HPS software protocol stack.

The h/w capabilities of the SNIs and the HPS are utilized by the HPS communications protocol stack, which is shown in Fig. 6. The underlying reliable communications transport protocol is called LAPI (for ``Low-Level Application Programming Interface''). LAPI is a ``single-sided'' communication protocol which executes in ``User-Space'', that is in the same processor context as the application code invoking it. LAPI negotiates directly with the device-driver and the micro-code of the SNI to setup data transfers on behalf of the application.

Fig. 7 Logical view of the HPS host adapter (SNI), attaching directly to two GX+ bus connections on the host side and two Switch Ports for the two HPS links, one to each of the two HPS switch planes.

LAPI supports three high-speed, user-space message transmission modes, which can be directly used by POE (MPI or LAPI) applications. The first one is called FIFO (or ``Packet-Mode''), the second one, Remote Direct Memory Access (RDMA or ``Bulk Transfer'') mode and the third one is called Shared-Memory message passing for all POE/MPI tasks when all run within the same SMP node.

  • In FIFO packet-mode (see Fig. 12), MPI applications typically prepare messages into application send buffers, using LoadD and STore memory instructions. LAPI breaks these messages into 2KiB packets and stores them into SNI accessible memory called ``send FIFOs'' (sFIFO). Data in sFIFO are DMAed directly into the SNI adapter and then are injected into the HPS network, without the help of the processor. Similarly, packets arriving from the HPS are DMAed into ``receive FIFOs''(rFIFO) which are then collected by LAPI to form the original message which is copied into the application receive buffers. Because the SNI can read and write directly to cache-memory coherently, data copies from application to sFIFO buffers can be avoided, if data is still in cache. The send requires at most one I/O (GX+) bus crossing but can avoid a memory access. Similarly, data from rFIFO buffers are written to application memory, but the very first cache-line. When the last FIFO packet is received, the first cache line is also written to signify the successful end of message reception. At this point, all application message contents is likely still in the cache memory and receiving application thread can be woken up to retrieve it and process it. Careful tuning of the communications protocol avoids extra data copies and crossing of the processor/memory bus, unless it is absolutely necessary. Note that Even though the FIFO mode can request to use both SNIs the benefits can be rarely realized. Please see below our POE/MPI performance results when using FIFO mode with sn_single (one SNI) vs. sn_all (all SNIs).

  • In RDMA mode (see Fig. 13), LAPI programs the SNI h/w to DMA data from the sending application buffer directly to the receiving application's one, without requiring the processor to execute any protocol on the target side. However, RDMA incurs an initial LAPI ``rendezvous''exchange to setup the transfer on the target side. This extends the message latency by at least one full Round-Trip time between the two end-points. Due to this initial overhead, RDMA is beneficial for larger messages, typically messages with size > 64Kibytes or size > 128 Kibytes. RDMA is not enabled by default and it requires the user to explicitly request it by the system. Note that applications with messages of size larger than 128 Kibytes could benefit from message striping, that is by requesting to use both SNIs simultaneously. RDMA transfers can achieve high throughput for large MPI (or LAPI messages) when both SNI adapters are requested. Local experiments has shown that 3Gibytes / sec per direction are possible when both SNIs are employed. On the other hand, if messages are relatively small, local performance experiments have shown that the FIFO mode is more efficient than the RDMA mode. Please see below our POE/MPI performance results when using RDMA mode and sn_single (one SNI) vs. sn_all (all SNIs).

    Using RDMA in Interactive Executions of POE Code
    To invoke RDMA mode in Interactive POE, set

    
        export MP_EUILIB=us
        export MP_EUIDEVICE=euidevice
    
        export MP_USE_BULK_XFER=yes
    
        export MP_BULK_MIN_MSG_SIZE=Threshold_to_start_Using_RDMA
        

    Using RDMA in Batch Execution of POE Code (LoadLeveler)
    To invoke RDMA mode in Load Leveler jobs, set

    
        #@ bulkxfer=yes
        #@ network.MPI = device,usage,US,HIGH
        export MP_BULK_MIN_MSG_SIZE=Threshold_to_start_Using_RDMA
    
        
    where device can be sn_single for using a single SNI adapter, or sn_all to use all SNI adapters, and usage can be shared or not_shared for shared and not shared adapter use, respectively.

    Important Note 1: There is a threshold for MPI messages below which the system will not engage RDMA even it has been requested but it will revert to the default "FIFO packet" mode. The default value for this threshold is currently set to 150 KiB (153600 bytes). A user can change this theshold by assigning a differnt value to the MP_BULK_MIN_MSG_SIZE environment variable. Our experiments have shown RDMA benefits for messages with size as low as 64KiB. The improvement becomes more prominent as message sizes get larger than 128KiB. See the performance section below.

    Important Note 2: our experiments have shown that RDMA is benefficial only if used along with the "sn_all" specification in LL scripts or interactively with the use of environment variableMP_EUIDEVICE set to sn_all. The sn_all value instructs the HPS stack to allocate both HPS planes for use by the POE application.

    Important Note 3: our experiments have shown that FIFO packet mode and the "sn_all" declaration offers very minimal benefits even though the HPS stack will still set aside resources on both HPS planes. This increases the strain on the HPS resources on the nodes, and therefore, the SC Facility strongly discourages the use of FIFO mode with sn_all.

A HPS ``switch-board'' is a 16X16 switching fabric (see Fig. 8) which connects 16 Power5 host SNI ports together. A switch-board consists of 8 "Switch-Chips" (SCs) similar to the one shown in Fig. 9. Each 16X16 HPS switch-board is a Bi-directional Multi-Stage Interconnection Network ("BMIN") or a "Fat-Tree". At TAMU the 40 575 hosts are connected together with three HPS switch-boards in a "triangular" topology as shown in Fig. 10. This particular connectivity is necessary to maintain the Fat-Tree topology for systems with more than 16 host ports. Note that Fig. 10 shows one of the two HPS planes installed at TAMU. Each transmission links is bi-directional and can carry data at 2 Gibytes / sec per direction.

SCs implement "cut-through" switching with buffering at the "Central-Queue" only for packets whose destination output port is busy. When a destination output port is idle, packet data flow directly from the input port to this output port ("cut-through switching"). Routing in the HPS is source based and it is adaptive. The source SNI specifies up to four different routes per SC for the path portion reaching the "Least Common Ancestor" (LCA) SC. As a packet travels towards the LCA, the switching SC which receives it selects one out of a predetermined set of possible output ports to forward it. Fig. 11 presentrs some pertinent examples of this specific routing approach. Notice that there are four distinct paths connecting endpoints (A, B), one path for (D, E) and again four paths for endpoints (C, X).

The decision of the particular output port is based on its current load. If all output ports are busy, the packet is stored in the memory of the central buffer within the SC. A SC will use (or schedule to use) the least loaded output port among the four allowed to be be used at each switching point by the source route. When the packet reaches the LCA switch chip it "turns" and travels downward towards the destination SNI. The path from the LCA to the destination is always unique and there is no adaptivity. An input port has 8Kibytes of packet space for incoming packets. A packet is divided into "flits" (for flow-control digits) and transmitted from an output port to the downstream input port. When a flit is received, the input port sends out an Acknowledgment to the upstream output port. The output port has one flit "credit" for every flit space that is known to be available at the downstream input port. There are enough flit credits to maintain the transmission link pipeline full of flits in transit. An output port will never transmit more flits than it owns credits in order not to overflow the receiving input port. This is link-by-link "back-pressure" flow-control at the flit-level and it is common in high-speed links. Note that even though flow-control can protect the downstream input ports from overflowing, it is still possible that congestion may form within the HPS fabric. Simply speaking, if two ports request the same output port in a SC, the storage for this output Port in the central queue will immediately get full. When a SC gets congested, a so-called "tree-saturation" forms as the congestion propagates from that SC upstream towards all sending SNIs which have to use this path. The adaptive routing is one of the factors which mitigate the end-to-end congestion problem but it is well known that this is by no means a complete solution. It is LAPI which applies end-to-end congestion control so that the fabric will not get overly congested for lengthy periods of time. LAPI sends up to a number of un-acknowledged packets over the HPS but it will wait until these are eventually get acknowledged by the receiving LAPI sides. This is a "sliding-window" congestion control heuristic which throttles the rate of new packet injection into the HPS network until the older transient packets are removed and acknowledged by their destination LAPI sides. Source routes are pre-computed at HPS boot-up time and may change if any part of the network malfunctions. AIX continuously monitors the sanity of each SNI and HPS port and it re-generates routes in case a path stopped working properly.

The nominal throughput is 16Gibytes/sec per direction per switch-board pair. Note that Fig. 10 shows a fully connected single plane switch topology with 48 p5-575 nodes. TAMU has a total of 40 nodes currently available and two switch planes. The above HPS connection topology requires the 40 nodes to be grouped together in three host groups, which at TAMU currently consist of 14, 12 and 12 nodes, respectively. Each group connects to another group with 8 full-duplex copper HPS links (for each HPS plane). The supports a nominal 16 Gibytes/sec inter-group communication capacity per direction per HPS plane.

Each SC has a low-latency (~59 nano-secs per packet), and it can administer automatic link-level retries, among others. There is a 0.4 micro-second latency across a switch-board.

See a discussion on the HPS architecture and ways to efficiently use it in this PDF. Please note that it is a draft and under continuous development.
We present below our preliminary performance results on the communication ``bandwidth'' of MPI over the HPS fabric. We present the performance of both blocking and non-blocking MPI calls using the different POE message transmission modes and HPS options, namely
  • Shared-Memory IPC for same node tasks, with shared and not_shared node usage;
  • FIFO (user space) packet mode across the HPS, using single or both SNIs, with shared and not_shared SNI adapter usage;
  • RDMA (user space) bulk data transfer mode across the HPS, using single or both SNIs, with shared and not_shared SNI adapter usage; and finally
  • IP (kernel space, UDP) mode which uses IP over HPS, for single or both SNIs, with shared and not_shared node usage.

The results are reported below in three PostScript files, each emphasizing a different message size region to illustrate better the performance curves. Note that this results are preliminary and under continuous development. At a later time the results will be expanded, explained and analyzed in a formal report. However, these preliminary results could already provide some good insight on the performance capabilities and used to provide rough guidelines as to how MPI code can set the various POE/LAPI options.

  • PS PDF for messages up to 256KiB;
  • PS PDF for messages Upton 1GiB; and
  • PS PDF for messages Upton 32GiBs

hydra is directly attached to 20+ Tera-bytes of disk space on a high-performance Data-Direct Network's DDN A29550 RAID array (see Fig. 14). The connection to the DDN RAID is through eight 4-Gigabit/sec fibre-channel links, with two FC links hosted by each of four I/O p5-575 server nodes. On the RAID side (see Fig. 15), each logical disk (LUN) is protected with two parity disks for increased recovery capabilities, far beyond the usual N+1 RAID configurations.

The hydra cluster deploys the latest version of GPFS which is IBM's high-performing, highly-scalable clustered file system. Currently, /home, /usr/local, /work and /scratch are GPFS file systems available to the users of hydra and are configured with respect to different performance objectives. /work and /scratch are striped four-ways across the four GPFS I/O servers. The number of I/O servers as well as the topology of their connectivity with the rest of the systems was chosen with respect to performance. Four I/O servers can fully utilize the raw throughput capacity of the high-end DDN S2A9559 RAID. The topology of the four GPFS servers is meant to spread the file traffic evenly over the HPS paths to all nodes, as evenly as possible. One can notice in the diagram in Fig. 10 that each cluster node has the same path length distance to all four GPFS I/O servers to avoid the instance were traffic has to wait for the longest path when the four paths have unequal lengths.

I/O Performance Results for Hydra and DDN S2A9550
During the last months, Hydra, our IBM 1600 Cluster and our Data-Direct Networks S2A9550 storage, went through a series of h/w I/O reconfigurations and system tunings. Performance experiments conducted after these modifications have demonstrated that now Hydra cluster is performing at (or better than) the highest published I/O throughputs levels for this type of h/w and RAID Storage.

Specifically, aggregate I/O read throughput at raw device level in the I/O nodes, conducted when Hydra was under full production workload, is at ~ 2600.7 Megabytes / sec. This is very close to the "theoretical" maximum performance the vendor DDN claims that the storage array can ever achieve under ideal conditions.

Preliminary performance experiments on IBM's parallel file system (GPFS) running on hydra, have reached ~ 2323.89 Megabytes/sec sustained, which is better than the ~ 2000 Megabytes/sec IBM has achieved with their top of the line systems under ideal conditions.

It is expected that the same experiments conducted under an idle system can reach higher levels.

Please inquire with the facility staff if you would like to have your code achieve the best possible performance with respect to GPFS I/O.

p5-575 servers have achieved top rankings in a number of standard industry and application benchmarks. The p5-575 has achieved the highest SPECfp_rate2000 measurement and the highest SPECompM2001 measurement for any 8-core server. Furthermore, both the 8-core and 16-core p5-575 systems have higher memory bandwidths than other 8-core and 16-core high-density RISC-based systems, respectively. Take a look at the end of this page to get some standard benchmarking numbers for individual p5-575 nodes. We could say that each one of the 40 p5-575 nodes is approximately 2.5 to 3 times as fast as our Power4-based IBM p690 (agave) system.

(**) Notice: The discussions in this page relies on numerous technical sources, including, papers published in the research literature, IBM technical reports, manuals and personal communications with IBM developers and researchers. See References section below. The contents is responsibility of Michael E. Thomadakis and along with the original artwork remain (C) copyright of his and of Texas A & M University's. Figure 15 is courtesy of Data Direct Networks, Inc. Any of the contents of this page can be freely used for educational purposes, as long as the copyright notice remains visible and the original author is cited.

Disclaimer: This page is Under Construction. Visit often for corrections and additions. Contact me at miket AT tamu.edu or at miket AT sc.tamu.edu for corrections and additions.

See also hydra's user guide for information on compiling and running interactive and batch jobs.

Table 1. Power5+ and p5-575 Configuration -- Summary

Component Specifications
Total number of processors 640 Power5+ at 1.9GHz
Total physical memory 1,280 Gibytes DDR2 DRAM at 533MHz
Memory architecture within p5-575 node Cache Coherent, Almost-Uniform Memory Access (cc-AUMA)
Main Memory (DRAM) 32 GB per p5-575; 64 DIMMs at 533 MHz
2 SMI II (with 4 DIMMs per SMI) at 1066MHz per SMI per Power5+ chip;
Memory architecture across p5-575 nodes Distributed-Memory cluster supporting Message-Passing communications via the High-Performance Switch for MPI and LAPI code
Operating system AIX Version 5.3 with Parallel Environment, CSM, RSCT and GPFS.
Processor type, ISA 1.9 GHz IBM Power5+ ® μ-processor; 64-bit PPC Architecture; Big Endian
Cache memories (on processor die)
Level Total size Block Size Associativity Write policy Ace's latency
(Clocks)
L1 (one per core) 64KiB instr 128 bytes 2 N/A 2
32KiB data 128 bytes 4 write-through 2
L2; shared by both cores 1.875MiB unified; 3 independent slices 128 bytes 10 (512 GC) write-back 12
L3; shared by both cores 39 MiB unified; 3 independent slices 512 bytes (2X128 Byte blocks) 12 write-back, victim cache for L2 80
Number of physical processors / node 16 [Fig. 1]
Size of local memory / node 32 Gigabytes DDR2 at 533MHz [Details in Fig. 4]
Number of Processors per node 16 (8 X DCMs) [DCM Fig. 4]
Number of p5-575 nodes 40
Cache coherence protocol within node Enhanced-Distributed Switch (cc-AlmostUMA)
Interconnection between p5-575
nodes for message passing
IBM® High-Performance Switch™
  • Dual-plane, HPS triangular topology [Fig. 10]
  • Three HPS switches connected together per plane;
  • Eight HPS links per switch-pair per plane;
  • Host HPS Adapter: SNI with two bi-directional ports [Fig. 7];
  • Each HPS is 16X16 with 16 host and 16 inter-switch HPS ports [Fig. 8]
  • Each HPS switchboard consists of 8 Switch Chips [Fig. 9] which are 8X8 cross-bar switching elements, with virtual output queuing and 32KiB central queue for blocked packets
  • 2 Gibytes/sec per direction per link
  • 59 nano-seconds latency per switch chip
  • path length is 1, 2 or 4 switch-chips depending on the number of switch chips the path crosses, in the origin-destination node pair.
  • Number of I/O nodes 4 GPFS I/O server nodes; each with two 4Gb/sec Fibre Channel adapter
    I/O and PCI-X slots / node Per p5-575 node
  • 3 X PCI-X buses (64-bit, 133 MHz),
  • 1 dual SNI port HPS adapter attached to two GX+ buses [Fig. 7]
  • 2 X 146.8Gibytes Ultra320 scsi disk drives
  • Networking Four 1-Gigabit (copper) ethernet ports per node
    System disks Two 146.8GB scsi disks per node
    Disk expansion unit DDN S2A9550 20+ Terabyte RAID array (2 parity disks/LUN)


    Fig. 8 The 16X16 HPS Switchboard. Each one of the six TAMU switches is like this one.

    Fig. 9 Logical view of the Switch Chip: 8X8 cross-bar with virtual output queuing and 32Kibytes central buffer.

    Fig. 10 Schematic detail of the "triangular" HPS connection that makes up an HPS plane. Each of the three switches connects through 8 HPS links to each of the other two switches. The schematic shows only one of the two planes that TAMU has installed.

    Fig. 11 Adaptive Least-Common Ancestor Routing examples as they apply to HPS.

    Fig. 12 Packet-Mode "FIFO" User-Space message transmission mode of the HPS.

    Fig. 13Remote Direct Memory Access (RDMA) User-Space message transmission mode of the HPS, with data-striping and fail-over across multiple SNIs.

    Fig. 14 The 40 p5-575 TAMU cluster with additional details for the connectivity of the four GPFS I/O server nodes to the DDN S2A9550.

    Fig. 15 Details of the DDN S2A9550 RAID array.
    References
    1. Main Documentation for Power5 Cluster 1600
    2. RSCT/LAPI (2.4.4.x) for AIX Cluster 1600
    3. Parallel Environment (POE/mpi, 4.3.1.x) for AIX Cluster 1600
    4. LoadLeveler batch scheduler (3.4.1.x) for AIX Cluster 1600
    5. AIX Information Center and Documentation Library
    6. AIX Compiler Documentation Library
    7. IBM RedBooks An Introduction to the New IBM eServer pSeries High Performance Switch
    8. IBM WhitePaper IBM High Performance Switch on System p5 575 Server - Performance, White Paper
    9. IBM WhitePaper IBM eServer pSeries High Performance Switch - Tuning and Debug Guide
    10. IBM Reference DocumentationSwitch Network Interface for eServer pSeries High Performance Switch Guide and Reference
    11. B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner," POWER5 System Microarchitecture," in IBM Journal of Research and Development, Vol. 49, Num. 4/5, 2005, pp. 505--521.
    12. H. M. Mathis, A. E. Mericas, J. D. McCalpin, R. J. Eickemeyer, and S. R. Kunkel, "Characterization of Simultaneous Multithreading (SMT) Efficiency in POWER5," in IBM Journal of Research and Development, Vol. 49, Num. 4/5, 2005, pp. 555--564.
    13. J. M. Tendler, J. S. Dodson, J. S. Fields, Jr., H. Le and B. Sinharoy, "POWER4 System Microarchitecture," in IBM Journal of Research and Development, Vol. 46, Num. 1, 2002, pp. 5--26.
    14. D. W. Victor, J. M. Ludden, R. D. Peterson, B. S. Nelson, W. K. Sharp, J. K. Hsu, B.-L. Chu, M. L. Behm, R. M. Gott, A. D. Romonosky and S. R. Farago, "Functional Verification of the POWER5 Microprocessor and POWER5 Multiprocessor Systems," in IBM Journal of Research and Development, Vol. 46, Num. 1, 2002, pp. 541--553.
    15. R. M. Gott, J. R. Baumgartner, P. Roessler and S. I. Joe, "Functional Formal Verification on Designs of pSeries Microprocessors and Communication Subsystems," in IBM Journal of Research and Development, Vol. 49, Num. 4/5, 2005, pp. 505--521.
    16. R. K Govindaraju, P. Hochschild, D. Grice, K. Gildea, R. Blackmore, C. A. Bender, C. Kim, P. Chaudhary, J. Goscinski, J. Herring, S. Martin, J. Houston, "Architecture and Early Performance of the New IBM HPS Fabric and Adapter," in LNCS, Volume 3296, Springer-Verlag, Dec 2004, pp. 156--165.
    17. G. Shah, J. Nieplocha, J. H. Mirza, C. Kim, R. J. Harrison, R. K. Govindaraju, K. J. Gildea, P. DiNicola, and C. A. Bender, "Performance and Experience with LAPI - a New High-Performance Communication Library for the IBM RS/6000 SP," in Proc. of IEEE Combined IPPS/SPDP 1998, pp. 260--266.
    18. M. Banikazemi, R. K. Govindaraju, R. Blackmore, and D. K. Panda, "MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems," in IEEE Trans. Par. Distr. Systems, Vol. 12, Num. 10, pp. 1081--1093, 2001.
    19. Rajeev Sivaram, Rama K. Govindaraju Peter Hochschild++ Robert Blackmore Piyush Chaudhary, "Breaking the Connection: RDMA Deconstructed ," in Proc. of the 13th IEEE Symp. on High Performance Interconnects, (HOTI'05), 2005.
    20. C. B. Stunkel, D. G. Shea, B. Abali, M. M. Denneau, P. H. Hochschild, D. J. Joseph, B. J. Nathanson, M. Tsao and P. R. Varker, "Architecture and Implementation of Vulcan, in Proc. of the Inte'l Par. Proc. Symp." (IPPS94) , April 1994.
    Abbreviation Key
    • KiB := 210 ("Kilo-Binary-Byte")
    • MiB := 220 ("Mega-Binary-Byte")
    • GiB := 230 ("Giga-Binary-Byte")
    • KB := 103 ("Kilo-Byte")
    • MB := 106 ("Mega-Byte")
    • GB := 109 ("Giga-Byte")

    IBM p5-575 Cluster 1600
    Standard Performance Numbers
    (Feel free to contribute your own benchmarking results)

    Multi-user Performance (AIX V5.3)
    Processor GHz L3
    cache
    (MB)
    Proc
    Mem
    BW
    SPEC (CPU2000) SPEC
    web99
    SPEC
    web99
    SSL
    int
    rate
    int
    rate
    base
    fp
    rate
    fp
    rate
    base
    16-core POWER5+ 1.9 288 202.7GB/sec 314 310 571 541 -- --

    SPEC and LINPACK Performance (AIX V5.3)
    Processor GHz L3
    cache
    (MB)
    SPEC (CPU2000) LINPACK
    int int
    base
    fp fp
    base
    DP TPP HPC
    1-core POWER5+ 1.9 36 1,526 1,473 3,042 2,830 1,315 -- 7,140
    16-core POWER5+ 1.9 288 -- -- -- -- -- -- 111,400

    Read our privacy policy
    Document last modified: [Monday March 31, 2008]
    Powered by the Apache WebServer
    Site maintained by webmaster@sc.tamu.edu