A High-Performance IBM
Power5+ p5-575 Cluster 1600
DDN S2A9550 Storage
for Texas A&M University (**)

by Michael E. Thomadakis, Ph.D.,2007−2010 (C)
Supercomputing Facility
Also see High-Performance Switch MPI Application Tuning Options

All material remains copyright © 2007−2010 of Michael E. Thomadakis and of Texas A&M University. The contents of this article may be used free of charge for educational purposes, provided that this Copyright Notice remains visible.

Table of Contents

Hydra is a high-performance IBM "Cluster 1600", with nodes based on IBM's highly successful, 64-bit Power5+ RISC microprocessor. The cluster currently consists of a total of 52 model p5-575 nodes, each having 16 Power5+ processors running at 1.9GHz and 32 GiBytes1 of DDR2 DRAM. Each p5-575 node is a high-performance, Shared-memory Multi-Processor (SMP), running the 64-bit version of AIX 5L (5.3) as a single system image. Out of the 52 nodes, 48 are connected together by a two-plane High-Performance Switch (HPS) fabric. The same 48 nodes connect to a DataDirect Networks S2A9550 high-performance disk storage. The normal mode of problem-solving on Hydra is running distributed or shared-memory computations under the control of the LoadLeveler batch scheduler. See the Beginner's Guide to "Hydra" High-Performance IBM 1600 Cluster for Texas A&M University for a quick introduction to Hydra's HPC environment and the Hydra's User Guide for information, including compiling and running interactive and batch jobs.

Abbreviation Key

This article refers to different common quantities to measure memory capacities and data transfer rates. The notation we will be using is defined in Table 1.

Table 1 Abbreviations of Quantities

Powers of 2 Powers of 10
KiB := 210 ("Kilo-binary-Byte") KB := 103 ("Kilo-Byte")
MiB := 220 ("Mega-binary-Byte") MB := 106 ("Mega-Byte")
GiB := 230 ("Giga-binary-Byte") GB := 109 ("Giga-Byte")
TiB := 240 ("Tera-binary-Byte") TB := 1012 ("Tera-Byte")
PiB := 250 ("Peta-binary-Byte") PB := 1015 ("Peta-Byte")

Usually, rates, such as data transfer or floating point operations per second, are expressed in powers of 10, while storage sizes in powers of 2. See this reference for a discussion on international units.


This write up presents a rather detailed discussion of the IBM "Cluster 1600" and underlying h/w and s/w technologies. Power5+ systems and 1600 Clusters have a balanced, overall, design where no resource becomes a significant bottleneck. The high performance of individual components rely on several cutting-edge innovations of the 2006--2007 time-frame. These technologies make Hydra one of the top high-performance platforms in the same time-frame. The discussion here may serve as background material for recent and future clusters in the Power series which currently are based on Power7 IBM microprocessors.

Our target audience falls into two broad categories. The first one includes those requiring intimate understanding of inner workings of the Power5+/Cluster 1600 platform, in order to develop scalar, parallel or distributed computations that fully utilize the underlying capabilities, while avoiding unnecessary bottlenecks of this platform.

The second one includes those who study parallel and distributed systems and need accurate and detailed account of the underlying architecture in order to improve them. The Power5+/Cluster 1600 architecture represents an important stage in the evolution of high-performance systems based on the Power Instruction Set Architecture.

Information presented here is not readily available elsewhere. Technical material on the p5-575 and the HPS internals are very hard to come by. The author of this article had to delve into a long list of sources from the research literature along with the standard onerous IBM technical documentation. All material remains copyright © 2007--2011 of Michael E. Thomadakis, of Texas A&M University and of the original sources. This work is dedicated to all those who have struggled to obtain the right information so they can use Power5+ and 1600 clusters efficiently.

In the Sections which follow, we focus on the Power5+ micro-architecture, the p5-575 multi-processor node, the High-Performance Switch, the associated software and protocol stacks which underlie message passing protocols such as MPI. Finally, we briefly introduce Load Leveler scheduler as a resource manager for the cluster and the General Parallel File System. References are given at the end for further investigation.

The Power5+ Processor

Power5+ processors implement the 64-bit PowerPC instruction set architecture, version 2.02 (PPC64 ISA V2.02). In the literature "Power5" and "Power5+", refer, respectively, to the 120 and 90 nano-meter lithography implementations of the chip. TAMU has the Power5+ version and we would be using Power5 and Power5+ interchangeably. Power5 builds on the well known and proven design points of Power4, which implemented version PPC64 ISA V2.01.

Overview of the Power5 Processor Chip

A Power5+ chip is a "Chip-Multi Processor" (CMP) as it contains two identical processor cores and shared Level 2 Cache memory within a single die Fig. 1 illustrates a Power5+ CMP chip. Theoretically each core can achieve 7.6 GFlops/sec and, unsurprisingly, a 16 core system, 121.6 GFlops/sec.

Power5+ chip

Fig. 1 A Power5+ Processor and Memory module. The processor chip contains two cores, L3 cache and DRAM controllers, and Enhanced Distributed Switch for coherent SMP traffic.

Power5+ processor-memory modules consists of a dual-core processor chip (with both cores active, each operating at 1.9 GHz), DRAM with up to eight DIMM slots and a private, high-performance, custom m36MiB Level 3 (L3) cache memory. Each processor chip contains a shared 1.9MB Level 2 (L2) cache, the memory controller and L3 cache directory. The L2 and L3 are each divided into three slices which can be accessed independently by any of the cores. The L2 slices can be accessed by the "Core-Interface Unit" (CUI) cross-bar.

All processor on-board resources, the CIU and the L2 operate at the 1.9 GHz frequency. However, the memory controller operates asynchronously, with the memory interface running at 1066 MHz to connect with the 533 MHz DDR2 memory. Notice that there is an "asymmetry" in the data bus widths connecting the parts together. Each Power5 core has two 32 byte input buses and only one 8 byte output bus. One input bus goes to the Level 1 instruction cache and the other to the L1 data cache. The CIU connects the two cores to each one of the three L2 memory cache slices. Each slice can be accessed independently by the cores. Each L2 cache memory slice has an 8 byte input and a 32 byte output bus connecting to the interface with the CIU. Each L2 slice has a 32 byte input and a 32 byte output bus attaching "downstream" to the "Enhanced Distributed Switch" (EDS). The EDS is a switch which routes data into and out of the Power5 processor to other neighboring Power5 chips. It is responsible to exchange data, commands and coherence traffic with neighboring EDS. The Layer 3 cache memory is off-chip and it is shared by both cores. There is one 16 byte input and a 16 byte output data bus connecting the L3 cache with the EDS. These buses and the L3 cache operate at 1/2 of the processor clock frequency, at 950MHz. The interface to the L3 cache is "elastic" which means that there is adequate buffering at both ends to allow for the signal timing variations that are possible at these high clock speeds. The L3 cache is shared between the two cores and it is also segmented into tree slices. Two of the coherent point-to-point connections with neighboring processor-memory modules are called the "Vertical" and the "SMP" fabric buses. Each is 8 bytes wide per direction and all operate at 1/2 the processor clock frequency, at 950MHz.

The high performance of the Power5+ architecture relies on the fact that the Fabric-Bus logic, the DRAM controller, the L3 controller and directory are housed within the same die as the two processor cores. This saves a significant amount of off-chip communications and makes possible a tightly coupled low-latency, high bandwidth CMP system. We will be discussing the cache and main memory subsystems to a much greater detail later in the memory hierarchy section of this article.

A Power5 processor chip consists of the following resources which are shared by both cores. Referring to Fig. 1, there are

Outside the Power5 chip, but at close physical proximity, we find the three-segment L3 cache memory. Physical proximity is accomplished by placing the L3 and the Power5 chips within the same carrier module. As we will see later this packaging comes into at least three variations, DCM, QCM or MCM.

Instruction Pipeline in the Core

Power5 processor cores are super-scalar microprocessors, with speculative, out-of-order execution data-paths, coupled with a multilevel storage hierarchy. Power5 has extensive support for data pre-fetching, branch prediction and speculative instruction execution. Fig. 2 illustrates in the part (a) the instruction pipeline and in part (b) the different functional units within a Power5 core.

Instruction pipeline of a Power5 core

Fig. 2 a.The instruction and execution pipelines of a Power5 processor.
b.The functional units in the Power5 processor data-path and their multiplicity for sharing between the two hardware SMT threads.

Power5 Pipeline Structure

The pipeline structure, illustrated in Fig. 2, part a, consists of a master pipeline (MP) and several execution (functional) unit pipelines (EPs), all of which can progress independently from each other. Each block in Fig. 2 part (a) is one pipeline stage which requires one clock cycle. The meaning of the various parts associated with this figure is as follows.

The MP presents speculative and in-order instructions to the mapping, sequencing, and dispatch functions of the core. It also ensures the orderly completion of the actually executed path and discards all speculative results associated with mispredicted branches. The execution unit pipelines allow out-of-order issuing of speculative and non-speculative instructions. The Power5 pipeline structure is very similar to that of Power4. Even the pipeline latencies including penalties for mispredicted branches and load-to-use latencies for L1 data cache hits remain the same.

Power5 Core Data-Path and Execution (Functional) Units

Each block in Fig. 2 part (b) is a Functional Units in the core and resources which are shared by both SMT h/w threads. FUs which are used during the execution of instructions are called "execution units" (EUs).

Instruction Flow through the Power5 Core Pipeline

In IF stage, the processor retrieves up to 8 instructions from the L1 (instruction) cache, in the same clock-cycle and feeds them into the "Instruction Fetch Buffers" . The L1 instruction cache is a 2-way, set-associative 64KiB on-chip memory. Instructions are scanned for branches and if they are found the direction is predicted (at BP stage) using 3 branch history tables. If all fetched instructions are branches, all can be predicted simultaneously. Power5 can predict the direction and the target of the branch. Information on branches are maintained in a special BIQ unit at fetch time. After fetch, instructions are put into buffers each capable of storing 24 instructions.

At D0 stage, up to 5 instructions are retrieved and sent through the three-stage instruction decode pipeline to form a dispatch group (at stage D1), all from the same thread. Complex instructions are cracked into internal operations (IOPs) to allow for simpler inner core data-flow. Instruction groups are then decoded in parallel. When all data-path resources are available the entire group is dispatched (at GD stage), with each instruction in program order.

From dispatch to completion, instructions in the Power5 pipeline are tracked in groups. Each group has an entry in the "Global Completion Table" (GCT). Each group can contain up to five "internal operations" (IOPs). The Global Completion Table (GCT) maintains information about each dispatched instruction group. Groups which are dispatched enter GCT and as instructions finish, the entry for the entire group is deallocated when all of its instructions have finished and their changes have been committed to architected storage. GCT entries are allocated and deallocated in program order for each thread, but entries for different threads can be intermingled.

After dispatch, each instruction goes through the "register renaming" stage (in MP stage) where "logical" (or "architected") registers referenced in the instruction are mapped to "physical" registers within the data-path. Power5 ISA prescribes that there are 32 64-bit general purpose registers (GPRs) available for general "integer" computation. Register renaming allows instructions referencing the same register but which have no true data dependency on each other, to proceed concurrently. This mechanism resolves the so called "write-after-write" and "write-after-read" data hazards and is one of the mechanisms commonly used to allow modern pipelines exploit available ILP within the instruction stream. Power5 cores provide 32 visible registers for general use to each SMT core, but internally the core offers 120 physical registers, available for register renaming. The Power5 ISA also prescribes that there are visible 32 64-bit floating-point registers, but the core also offers 120 physical registers which can be used for renaming in FP instructions.

After renaming, instructions enter shared "issue queues" waiting for the execution stage. When all input operands of an instruction are available, this instruction is placed in an issue queue (ISS stage). The processor then selects the oldest instruction in the ISS. Instructions can be chosen from different groups and SMT threads for issue without distinction. Up to 8 instructions can issue in each clock cycle simultaneously, one to each of the 8 functional units in the data-path (see later).

Power5 cores include the following instruction issue queues

Load and store instructions in a group before dispatching require the allocation of entries in the Load Reorder Queue (LRQ) and Store Reorder Queue (SRQ), respectively. These queues are used to maintain the program order of load and store instructions.

Upon issue, each instruction retrieves inputs from the physical registers (at RF stage) and executes at the proper functional unit. Specifically, the branch execution unit, the fixed-point execution units, and the logical condition register unit are used in the EX pipeline stage. The load and store units are used in the EA, DC, and Fmt pipeline stages. Floating-point execution units are used in the F1 through F6 pipeline stages.

Each Power5 core contains the following pipelined execution units:

The results are finally written back to the output physical register (at WB stage).

When all of the instructions in a group have finished (without generating any exception) and the group is the oldest group of a given thread, the group is committed (at CP stage). One group can commit per cycle from each thread.

Power5 instructions take

and Power5 cores can issue out-of-order up to 8 instructions in the execution pipelines

In summary, each Power5 processor core has eight execution units, each of which can issue instructions out of order, with a preference towards oldest instructions. Each execution unit can issue and complete an instruction at each clock cycle. Note that, the total latency of an instruction depends on the number and nature of each pipeline stage of execution and the latencies to generate results these instructions depend on. At any given clock cycle, a typical pipelined instruction workload may consist of 20 groups past dispatch (with five instructions per group), 32 outstanding loads, 16 outstanding misses, each for the two independent SMT threads.

Floating-Point Processing and Exception Handling

Power5+ processors implement, with the assistance of appropriate system software, a floating-point system compliant with the ANSI/IEEE Standard 754-1985, "IEEE Standard for Binary Floating-Point Arithmetic". This standard prescribes the binary representation of floating and fixed point quantities, required arithmetic operations (addition, subtraction, sqrt, etc.) and conditions which render machine arithmetic valid or invalid. When the result of an arithmetic operation cannot be considered valid or when precision is lost, the h/w signals a Floating-Point Exception (FPE).

Power5 Floating-Point Exceptions The following floating-point exceptions are detected by the processor:

  1. Invalid Operation Exception
    • Signaling NaN
    • Infinity - Infinity
    • Infinity Infinity
    • Zero Zero
    • Infinity Zero
    • Invalid Compare
    • Software-Defined Condition
    • Invalid Square Root
    • Invalid Integer Convert
  2. Zero Divide Exception
  3. Overflow Exception
  4. Underflow Exception
  5. Inexact Exception
Each floating-point exception has well defined conditions which raises it and the IEEE 754 standard determines specific procedures to handle it. A user application has a number of choices in how to respond, if necessary, to these exceptions. However, detailed treatment of FPEs is far beyond the scope of this write up.

Please review the following presentation on IEEE Floating-Point Standard and Floating Point Exception handling which has a section dedicated to Power5.

Simultaneous Multi-Threading

A Power5+ core supports Simultaneous Multi-Threading (SMT), a design scheme which allows two hardware threads to execute simultaneously within each core and share its resources. Fig. 2 part (b) shows the different functional units within a Power5 core which the two SMT threads can share.

Basic SMT Principles

The objective of SMT is to allow the 2nd hardware thread to utilize functional units in a core which the 1st hardware thread leaves idle. Fig. 3, left side, demonstrates the occupancy of functional units by the two SMT threads on a core over a small number of consecutive clock cycles. The vertical dimension shows the functional units of a core and the horizontal one, consecutive clock cycles. As you can see, both SMT threads may "simultaneously" (i.e., at the same clock period) utilize these units, making progress. See this subsection above and Fig. 2.b for the available functional units on a Power5 core.

The alternative to SMT would be to let a thread run until it has to stall (e.g., waiting for a lengthy FP operation to finish or a cache memory miss to be handled), at which point in time the OS dispatcher would have to carry out a costly context-switching operation with processor state swapping. SMT exploits concurrency at a very fine level and it obviates the need for a context-switching. The advantages of SMT are several, including among others, the increased utilization of functional units that would have remained idle, the overall increased throughput in instructions completed per clock cycle and the overhead savings from the lower number of thread switching operations.

SMT in Power5 cores

Fig. 3 Simultaneous Multi-Treading (SMT) concept on Power5 chips.

When SMT is ON, each Power5+ core appears to the Operating System as two logical processors. An SMT enabled p5-575 node appears as 32 logical processors to AIX.

Resources on Power5+ Shared Among SMT Threads

The Power5 core provides a number of partitioned or replicated units to be shared between the two h/w SMT threads, including

Practical SMT Benefits

SMT enabled processors offer performance advantages to certain types of computation mixes. In our experiments, parallel computations with tasks running on the same p5-575 node, proceed faster when the average run-queue length on a node is > 16. This is clearly shown on Fig. 4 which plots the wall-clock time of the same parallel code when run on a p5-575 node with SMT=OFF and when run with SMT=ON. As you can see, when the code has > 16 computation tasks (threads or processes), the SMT=OFF case starts taking much longer time to finish. It appears that the increase in the wall-clock time with SMT=OFF is exponential.

Performance Improvements with SMT=ON

Fig. 4 Performance Improvement with SMT=ON on a Dense Matrix Jacobi Problem with a 2000x2000 matrix size.

This is expected as there are more compute threads / processes than computation cores competing for resources. However, with SMT=ON, even though we request more cores than physically available, the wall-clock time is only affected slightly. Nevertheless, this demonstrates that regardless of the types of computation, when a node has to handle more compute threads than available cores, it will always perform better with SMT=ON.

One interesting note for those who need to benchmark their code to determine the number of processors to use or if SMT ON is beneficial, relates to the problem size. Please see the red colored curve on Fig. 4. The blue and the green curves where observed when the problem size was worth of a 2000X2000 dense matrix. However, when a smaller problem size of 1000X1000 is used, the code shows neither speedup nor slow-down. This tells us that a 1000X1000 problem size, is too small and the code cannot scale as we increase the number of computation tasks. However, it is worthwhile to notice that with SMT=ON even a non-scalable problem size, does not suffer from any significant slow-down as the case would be with SMT=OFF.

One may believe that if they request exactly the total number of processors available on a node for their parallel computation,they would never encounter this problem. However, in every parallel executable (MPI, OMP or Pthread based), the system creates auxiliary threads which support the parallelism of the computation threads. For instance, MPI under AIX ("POE") generates at least three additional threads for each MPI task. It is often the case, that compute threads waiting for message transfers to finish, continuously poll for the completion of the communication. This places a rather high compute load on the system, even though there is no actual progress in the user computation. Another situation that has been observed is that the Operating System, from time to time, has to carry out intense processing in bursts. One example is when it scans the page list looking for free pages to allocate to an application, or when, it has to carry out I/O operations. This happens when for instance, a user reads or writes to GPFS files. On a p5-575, when RDMA is not used for MPI communication (please see later discussion on HPS), the SNI has to execute device driver code to prepare message transmission or handle message reception. In short, even when one requests for computation exactly the number of available cores, it is very likely that their code will experience a slow down. SMT=ON is worthwhile exploring for code that consists of multiple threads which exceed in practice the number of physical cores. Single-Threaded mode is much slower in this case, exponentially so.

SMT is more beneficial when the contention for common resources shared by the two hardware threads is lower. These resources can be functional units within the core and L1, L2 or L3 cache memory blocks within the same Power5 chip. If used properly experiments have shown that SMT increases the completion rate for instructions within a system. Studies suggest that the benefit accrued for certain applications can reach up to 15% to 35% reduction in compute time.

The Future of SMT in Processors

The increased miniaturization doubles the number of transistors on a die approximately every 18 months ("Moore's Law"). This allows more functional units and memory to become available on-chip. It also implies an increase in the processor frequency of the processor clock. Unfortunately, even though digital electronics allow higher memory capacities, the access time has not been scaling down at the same pace as the processor clock periods. Increased transistor densities initially lead to the design of very large and complex, monolithic processor cores with a large number of pipeline stages. These chips are, in general, much harder to design and have very long and costly design / debug cycles.

The initial euphoria in modern processors employing super-scalar, wide-issue, super-pipelined, out of order completion pipelines, passed quickly when it became clear that these designs have passed the point of diminishing returns. Even in the most complex designs, most of the execution resources often remain idle because the pipelines stall waiting long memory accesses to finish, or requiring flushing due to mispredicted branch executions. Simply speaking, single thread computation in most codes does not have the high Instruction-Level Parallelism (ILP) opportunities which are necessary to keep the deep pipelines busy. One approach is to carry out an expensive multi-cycle thread-switching operation to dispatch a ready thread with resources available for execution. However, thread-switching requires draining the pipeline from current pending instructions and then experiencing a pipeline start-up overhead.

The higher chip density came at another increasing cost: the shrinking of the transistor gate sizes has increased leakage current, which lead to higher power consumption. With increased clock frequencies the on-chip interconnects have higher relative latencies and are becoming a major cost factor. In short, the higher power consumption and low resource utilization has made the designs of large monolithic processor cores with ever increasing clock frequencies undesirable.

SMT, along with Chip-MultiProcessor (CMP) technologies, appear to be the next step in the evolution of processor design. SMT allows more than one thread to be in the processor using any left-over resources. It thus benefits code which has reached its ILP scalability limits. CMP is a much more efficient form of SMP since data does not have to cross chip boundaries, something which is relatively very costly compared to accessing on-chip memories. It is expected that SMT will not only increase the utilization of idle functional units but also make the ratio of Flop/Watt more favorable as more functional units carry out useful work per clock cycle, thus requiring fewer clock cycles to finish a computation. It is not surprising that all major microprocessor vendors have already produced successful SMT data-paths and proceed along this path by increasing the degree of SMT concurrency further and further. For instance, the upcoming Power7 processor has 8 cores on a die, each capable of four-way SMT. Intel's Nehalem EP processor is four-core with two-way SMT and the upcoming EX is anticipated to be 8-core system with SMT.

In summary, SMT benefits code with instructions experiencing longer than average clocks-per-instruction rates. For instance, when one hardware thread stalls on a memory miss, the 2nd one can start using an idle FP or Int unit. However, SMT is not a panacea and may not benefit all types of codes. There have been cases where code experiences a relatively mild slow-down. Code with heavy usage of the floating point units or the L1 and L2 caches most likely wont benefit from SMT. Compilers and operating systems must now take into consideration these new parameters if SMT is to be used beneficially. Even though SMT is not a new area, more investigation is necessary to make its deployment more readily usable.

Memory Hierarchy in Power5 Processors

Power5+ Address Translation Process

Power5+ chips support translation from a 64-bit "effective address" (EA) into a 65-bit virtual address (VA) and then to a 50-bit "real address" (RA). EAs are logical addresses which applications use as their code executes, whereas, RAs are actual addresses the memory hardware uses to identify specific memory locations.

ERATs Once translated, the (EA, RA) pair is stored in one of the two first-level translation tables called Effective-to-Real Address Translation (ERAT) tables. There are two ERAT caches: the I-ERAT (for instruction address translation) and the D-ERAT (for data address translation). The I-ERAT is a 128-entry, 2-way set associative translation cache that uses a FIFO-based replacement algorithm. The D-ERAT is a 128-entry, fully associative translation cache that uses a binary LRU replacement algorithm. ERATs are limited resources and when the system cannot find an address translation in an ERAT (ERAT miss) it tries to locate it in a TLB.

TLBs and SLBs The processor architecture specifies a Translation Lookaside Buffer (TLB) and a Segment Lookaside Buffer (SLB) to translate the EA used by software to a RA (physical address) used by the hardware. Each Power5+ processor core contains a unified, 2048 entry, 4-way set associative TLB. The TLB is a cache of recently accessed page table entries that describe the pages of memory. The SLB and TLB are only used if the ERATs fail to find the needed mapping. If both SLB and TLB fail as well, the in-memory page tables are being walked to get the mapping. This is a very expensive operation, and if it happens frequently it may highly inflate the average number of cycles per instruction (CPI) for the applications.

VM Page Sizes Supported For Power5+, each entry in the ERATs provides EA to RE translation for a virtual memory page. VM page sizes supported are 4 KiB, 64 KiB, 16 MiB and 16 GiB. 4KiB page sizes are the default for user code and data. 64KiB may be freely used when an executable chooses to do so at launch time. 16MiB and 16GiB pages can on be accessed by specially authorized applications and require pre-allocation at system boot time. Experience by users in various supercomputer centers with different page sizes recommends that 64KiB page sizes is used for data and stack segments. The 64KiB page size provides the best trade-off between memory access times and internal memory fragmentation. SC strongly recommends that code running on the system leverages 64KiB size pages for data areas, especially if the code experiences high spatial locality in memory access.

It should be clear that TLB misses are very costly in terms of clock cycles and applications which minimize TLB miss rates are more efficient. One easy way to minimize the TLB miss rates is by employing 64 KiB or even 16 MiB pages whenever appropriate. For instance, if only 4KiB pages are used the "TLB coverage" is 8MiBs. With all 64KiB pages used the TLB can cover up to 128MiB. However, modern scientific and engineering workloads have a "Working Set" size which is usually much larger than that, and thus could easily suffer from high TLB miss rates. For those instances the 16MiB pages would yield better performance results as they could cover up-to 32GiBs of memory.

Unfortunately, with AIX using 16MiB pages is not all that flexible and experience using different page sizes strongly suggests that 64KiB size pages provide the best compromise between performance and flexibility. For a good discussion on how to use multiple page sizes in AIX code see the following IBM WhitePaper " Guide to Multiple Page Size Support on AIX 5L Version 5.3"

Processor-Memory Architecture

The Power5+ architecture includes a number of improvements over the well-established Power4 one. Fig. 5 illustrates the processor memory interface of the Power4 on the left and Power5 on the right side. An important difference between Power4 and Power5 processors is that on Power5, the L3 cache and its control logic have been moved out of the critical path of communication between processor and main memory. This has decreased the latency in all processor-memory operations and has sped up coherence operations for the cache memories. In Power5, L3 acts as a "victim cache" for all cache blocks which need to be replaced in L2 and it is not inclusive of the L2 cache blocks. With large L3 victim cache memories, "past" working sets of executing threads can temporarily migrate to L3 and then brought back into L2 and L1 when the thread starts reusing them. This is much faster than letting these cache blocks get written out to DRAM and then read back in at a later time. Under Power4 all blocks have to first get written to L3 and then to DRAM which created in some cases a bottleneck for memory access.

Power4 Processor/Memory Interface Power5 Processor/Memory Interface

Fig. 5 Architectural differences in Processor-Memory interface between Power4 and Power5 based systems.

Cache-Memory Structures

Power5+ microprocessors contain up to three levels of cache hierarchy. Fig. 1 shows the interconnections among the various cache levels and main memory. Refer to Fig. 2 part (a), for the L1 caches on the core.

L1 Cache At Level 1 (L1), separate instruction and data caches are part of the Power5+ core. This style is called a "Harvard" cache organization. The instruction and the data cache are 64 KiB and 32 KiB, respectively, in size and both have a block size of 128 bytes. In SMT mode, the caches are shared by the two hardware threads running in the core. The instruction and the data caches have 2-way and 4-way set associative organization, respectively, and employ an LRU cache-replacement policy. For both caches, the bus width from the L2 cache is 256 bits (32 bytes). It thus takes a minimum of four "beats" (clock cycles) to retrieve a 128-byte cache line from L2 cache. The data cache has a "write-through" policy, where updates to an L1 block are also propagated to L2 cache, by a separate 8 byte data bus. The assumption is that a core has to read more bytes for instructions and data than has to write out to memory. Access time for both L1-I and L1-D caches is 2 clock cycles.

L2 Cache The Level 2 (L2) cache is a 10-way set associative, unified cache (for both instructions and data) and it is shared by both cores on the Power5 chip. L2 cache consists of 128 byte cache blocks and maintains full hardware memory coherency within the SMP system. It is able to supply modified data to cores on other Power5 processor chips and I/O devices. The L2 is a write-back ("copy-back") cache, as it does not propagate immediately changes to the next levels in the memory hierarchy such as L3 or system memory. The L2 cache responds to other processors and I/O devices requesting any modified data that it currently has. The L2 cache is fully inclusive of the L1 instruction and data caches located in the two processor cores on one Power5 chip. The L2 is a total of 1.9 MB and is physically partitioned into three symmetrical slices with each slice holding 640 KiB of instructions or data. Each slice consists of 512 associative sets and has a separate L2 cache controller.

Either processor core of the chip can independently access each L2 controller and slice. Each slice has a 16 byte wide "castout", "intervention" and "push" bus to the fabric bus controller which operates at 950 MHz (at half of the core's frequency). Access time for L2 cache is 12 clock cycles.

L3 Cache The Level 3 (L3) cache is a unified, 12-way set associative, 36 MiB cache shared by both cores on the Power5+ processor chip. In the Power5 architecture, the L3 is a non-inclusive, and a victim cache of the L2. Non-inclusive means that the same cache line will never reside in both L2 and L3 caches simultaneously. L3 is a victim cache because it receives valid but modified cache lines, which are cast out from the L2, after being replaced by the L2 LRU replacement algorithm. The L3 has a write-back policy and also implements LRU block replacement.

The victim cache design permits the Power5 to satisfy L2 cache misses more efficiently with hits on the off-chip L3 cache, thus avoiding costly coherent traffic on the inter-chip fabric. A L2 cache miss lets the system check the L3 cache before sending requests onto the inter-chip fabric. The L3 operates "on the side", with separate 128-bit (16-byte) data buses for reads and writes that operate at 950MHz (1/2 of the processor clock).

The L3 cache is implemented off-chip as a separate "Merged Logic DRAM" (MLD) cache chip and like L2, it consists of three symmetric slices of size 12MiB each. The L3 cache block size is 256 bytes. Note However, that the L3 cache directory and control reside on the Power5 processor chip itself. This minimizes the delays in the processor to check the L3 directory after an L2 miss, avoiding off-chip access times. The L3 maintains full memory coherency with the rest of the system and can supply "intervention" data to cores on other Power5 processor chips. Access time for the L3 cache is 80 clock cycles. Peak bi-directional bandwidth between each Power5+ chip and its L3 cache is 30.4 GiB/sec, or 243.2 GiB/sec aggregate for the 8 DCM p5-575.

Memory Access Enhancements

Data Pre-fetching Power5-based systems use hardware (or software) to pre-fetch data into the L1 data cache. Load instructions missing ascending or descending sequential cache lines trigger the pre-fetch engine which initiates accesses to the following cache lines. Pre-fetching done at the "right time" may reduce substantially the degree of pipeline stalls caused by data not currently cached "close" to the processor at their request time. The L1 data cache pre-fetch is initiated when a load instruction references data from a new cache line. At the same time, the transfer of a line into L2 from memory is requested. Since the latency for retrieving a line of data from memory into the L2 is longer than that for moving it from L2 to the L1, the pre-fetch engine requests data from memory twelve lines ahead of the line being referenced by the load instruction. Power5 supports Eight such streams. The pre-fetch engine "ramps up" pre-fetching slowly, waiting an additional two sequential cache misses to occur before reaching the steady-state pre-fetch rate.

The Data Cache Block Touch ("dcbt") instruction, which hints the hardware that a pre-fetch stream should be installed immediately, without waiting for confirmation, supports s/w initiated pre-fetching. It now includes a field to indicate to the hardware, how many lines to pre-fetch. When the pre-specified number of lines has been pre-fetched, using the same mechanism as with hardware-initiated pre-fetching, the stream is terminated. Software-directed pre-fetching improves the performance for short streams by eliminating the initial ramp-up, and it eliminates pre-fetching unnecessary lines by specifying the last line to pre-fetch.

Pre-fetching may actually harm performance if the memory to processor path is already congested or when pre-fetching would cast-out data which are actually needed in the near future.

p5-575 Node Architecture

Each of the 52 p5-575 nodes is a cache-coherent Non-Uniform Memory Access (ccNUMA) Shared-memory Multi-Processor (SMP). A p5-575 node consists of a "compute planar" with 8 Dual-Core Power5+ Modules (DCMs) and an "I/O planar" providing the I/O capabilities and connections to the High-Performance Switch interconnect of the cluster.

Dual-Chip Module (DCM) Packaging

Dual Core Module with Power5+ cores

Fig. 6 A Dual-Core Module (DCM) with 2 Power5+ cores and a shared 36MiB L3 cache. p5-575 nodes are DCM-based.

A DCM (see Fig. 6) packages a dual-core Power5+ chip and the 36 MiB L3 cache memory at close proximity. The DCM packaging is a high performance ceramic substrate with several layers of metal interconnects. The module is strictly passive, and supports reduced signal propagation delays and increased data exchange throughput.

A DCM connects to the main memory by three independent buses, namely a 16 byte read bus, a 4 byte write and an address and control bus. These buses connect to the two "Synchronous Memory Interface" (SMI) chips which are the interface to the DDR2 memory of the SDRAM. The SMI chips contain buffers to match the differences in interface widths between the controller side and the DIMMs. Each SMI has two DDR2 8-byte channel ports and up to two DDR2 533MHz chip can be attached to each channel. The four DDR2 chips per SMI are called a "quad". The memory controller logic straddles two clock domains which operate asynchronously to each other. The buses from the SMIs to the memory controller operate at twice the speed of the DDR2 channels. We say that for each DCM the "ideal" total, read and write memory bandwidth is 21.12 GB/s, 16.896 GB/s and 4.224 GB/s, respectively. This per core translates to approximately 10.56 GB/s which is not bad at all for a system which was implemented in the early 2006 time-frame.

Each DCM connects to other DCMs through two "fabric buses", the "SMP Fabric Bus" and the "Vertical Fabric Bus" as shown in Fig. 6. Fabric buses carry coherent data traffic among DCMs. Finally, there is one I/O bus called GX+ bus which carries data traffic from the processor memory to the I/O subsystem. All buses, including the data channels from the memory and L3 controllers attach attach to the "Enhanced Distributed Switch" (EDS) which manages the traffic within the DCM and between DCM. More on this shortly.

p5-575 SMP Nodes Compute Plane

A typical "cache coherent, Non-Uniform Memory Access (cc-NUMA)" Shared Memory Multiprocessor (SMP) is shown in Fig. 7 (a). In cc-NUMA SMPs main memory is partitioned into physically distinct modules with independent access paths to each module. The processors directly attached to a memory module experience lower latency and usually higher bandwidth than processors attached to "remote" memory modules. This architecture is sometimes called "distributed shared memory". Ideally, each processors is "mostly" processing data found in its directly attached memory module and only occasionally accesses data located in remote modules. "Dense Parallel" algorithms are amenable to this type of computation. Memory itself remains cache-coherent at the memory cache block levels.

The main advantage of a cc-NUMA is scalability in memory bandwidth when data access is evenly distributed across memory modules. Unfortunately, with cc-NUMA cache coherence and memory consistency becomes a very challenging problem. When processors access memory blocks in remote modules, access latency and bandwidth are bounded by that of the high-speed cache coherent interconnect. This non-uniformity in local vs. remote memory access has lead these architectures to be called "cc-NUMA".

Another disadvantage is that cache coherence mechanisms are very difficult to scale with system size. Processors which update the same memory block force the memory system to shuttle this block back and forth among modules. Keeping a memory block consistent with the "latest" written data, creates basically serialization in the memory access and it may easily force parallel code to lose a substantial portion of its available concurrency. Application and system developers are now working extra hard to ensure that their code and the system avoids serialization phenomena which basically defeat the objectives of parallelism.

Typical ccNUMA SMP Node View

Fig. 7 (a) A high-level view of a cache coherent Non-Uniform Access (ccNUMA) SMP node. p5-575 Node View

Fig. 7 (b) A 16-way p5-575 node (8 DCMs with 2 Power5+ cores per DCM).

Eight DCMs connect together via the Distributed Bus Fabric (DBF), to form a 16-Way ccNUMA SMP (see Fig. 7 (b)).

We can see in Fig. 7 (b) that each DCM through its associated Integrated Memory Controller (IMC), directly attaches and manages a physically separate DRAM module. As expected, the latency to access data belonging to a DRAM attached to the same DCM as the core which makes the request is shorter than the latency to access data from a remote DRAM module. Latency numbers reported indicate that local memory access requires approximately 90 nano-secs, whereas remote memory access requires approximately 200+ nano-secs.

Note that memory is allocated to applications at the virtual page level. p5-575 supports currently two page allocation modes: round-robin which allocates the next page from the "next" DRAM module and first-touch which attempts to allocate the page from the DRAM which is attached on the same DCM as the core which makes the request. There are advantages and disadvantages with either method. AIX kernel supports the concept of "resource sets" which may include processors and/or memory modules. One may define arbitrary resource sets which can draw memory from associated DRAM modules only. However, this approach requires careful thinking as performance can quickly deteriorate when the wrong choices are made.

Distributed Bus Fabric

The DBF consists of the collection of SMP and Fabric point-to-point buses which functions as a distributed switch interconnect. The DBF provides cache-coherence and high-speed data exchange among the processor cores. Each bus is 8-bytes wide and operates at 950 MHz, that is at 1/2 of the processor's clock. All inter-DCM buses attach to the EDS, which is also called "Bus Fabric Controller" (BFC). The fabric buses form a 2D interconnect. Each point-to-point bus can transfer data at 7.6 GB/sec per direction. Referring to Fig. 7 (b), the collection of SMP buses (blue color) form two loops: one running through the top and the other through the bottom four DCMs. The vertical fabric buses connect directly DCM pairs together and shorten the "hop-count" distance the messages have to travel. Nevertheless, the fabric bus throughput (7.6 GB/sec) is an absolute upper bound on the data transfer rates from remote memory DRAM modules to a core. Note that h/w and s/w pre-fetching, done at the right time can alleviate this bandwidth limitation.

Fabric Bus Controller (FBC)

The FBC component primarily buffers and sequences operations among the L2 and L3 caches, the functional units of the memory subsystem , the fabric buses that interconnect multiple DCMs and the data transfer to the I/O subsystem through the GX+ bus. The fabric bus has separate address and data buses that run independently to allow "split transactions", which are tagged to allow out-of-order replies and asynchronous operations.

Cache Coherence Protocol and Memory Transactions

The p5-575 is employing a non-blocking, broadcast-based variation of the MESI cache coherence protocol to maintain memory consistency. Performance of Power5 processor cache coherence protocol lies in the combination of the weakly ordered IBM PowerPC processor storage architecture, high-bandwidth SMP interfaces, low-latency intervention, and optimized locking primitives. The protocol consists of a set of request, response, and notification types of messages.

Coherence requests are broadcast to such units as caches as well as memory and I/O controllers. Upon snooping the requests, the units provide responses, which are aggregated to form the basis for a coherence transfer decision. The notification of the decision is subsequently broadcast to the units and the actual data transfer is effected.

Address Bus

Initially, an L3 cache miss causes the Bus Fabric Controller (BFC) to broadcast the address of the target data block to the entire SMP system, over the "Address Bus" (AB). Every address transfer uses the AB for 4 processor cycles. Address messages include a 50-bit real address, an address tag, transfer type, and other relevant data that uniquely identifies each address as it appears on the bus. The DBF broadcasts addresses from DCM to DCM using a ring structure. The address propagates serially through all eight DCMs in a 64-processor system. Each chip on the ring that initiates the address (or receives it from another upstream DCM) is responsible for forwarding the address to the next in the ring DCM. Once the originating DCM receives the address it transmitted, it does not continue propagating it.

Response Bus

The inter-DCM Response Buses (RBs) run at half the processor frequency. A RB is essentially a delayed version of the AB. It carries information related to the cache coherency actions that must be taken after the memory subsystem units on each processor chip have snooped (checked the state of) the address with their data queues and directories. The response phase follows the address phase after a fixed number of cycles on the same set of buses. That is, once a caching location snoops an address, it must return a response to the source chip or the chip on the source ring in a fixed number of cycles, where the actual number of cycles depends on system size. Once a DCM receives a response from the upstream DCM, it combines these responses with its own response and forwards this collective response the next DCM in the ring for further aggregation.

When the originating DCM receives the aggregated snoop response for the address it initiated, it generates a combined response that it then broadcasts throughout the system just like with an address phase. The combined response prescribes the ultimate action to be taken on the corresponding address. For example, what state the originator can now go to, which DCM should supply the data to the originator for a read operation, or cache block invalidations to be performed in the background.

Data Intervention

To reduce cache-to-cache transfer latency, Power5 systems support the notion of an early combined response mechanism, which allows a remote DCM to send a block from its L2 (or L3) cache in shared state to the requesting processor soon after it receives the address. The early combined response is determined by looking at one owns DCM response with the aggregated snoop responses of all previous DCMs in the ring, which chooses the first coherent snooper that can supply the read data. This part of the protocol allows the initiation of the data-transfer phase by a coherent snooper soon after it receives the address. When such a coherent snooper is found, the early combined snoop response sent to the downstream snoopers also indicates that the source of data has already been found.

Data Bus and Phase

Like the address bus, the data buses on the DCM also run at 1/2 of the processor frequency. As Fig. 7 shows, DCM pairs connect by the vertical (green) data buses. On a p5-575 there are four processor chips connecting together via four separate rings. These vertical busts shorten the hop count distance of the data transfers among DCMs.

The DB services

Details of the Power5 Cache Coherence (CC) Protocol

The strength of the Power5 CC protocol lies in its asynchronous nature which permits distributed operation and control. The protocol itself is an extension of the widely known MESI protocol enhanced with a total of nine associated cache states.

However, broadcast-based snooping protocols suffer from a basic disadvantage. Coherence traffic and the associated bandwidth requirements grow by ON2 ), where N is the system size. Alternatives to broadcast "snoopy" protocols include "directory base" cache coherence. These schemes are popular because they localize broadcasts to smaller subset of the nodes which maintain directories indicating when regions of memory owned by a given node are cached by other nodes. This can greatly reduce coherence traffic flow outside the node.

High-Performance Switch Cluster Interconnect

The powerful Power5+ 575 nodes form high-performance clusters with an equally high-performance interconnect fabric, called the "High-Performance Switch" (HPS). At TAMU, 48 nodes of the Hydra's cluster connect together through two HPS planes (see Fig. 8). The current HPS employs the fourth generation technology in host adapter (see Fig. 6), switch fabric (see Fig. 8), and transmission links.

HPS Hydra Illustration

Fig. 8 The 48 node Hydra p5-575 cluster with only one of the HPS planes shown.

One significant advantage of the Power5+ architecture in the p5-575 implementation is the fact that the High-Performance Switch adapters directly attach via the GX+ bus on the processor-memory interconnect, bypassing expensive I/O bridging logic (see Fig. 7). The p5-575 SMP employs two such GX+ buses to support the high bandwidth that is needed by the two HPS switch ports. Each GX+ bus has more bandwidth than necessary to support one of the HPS links on the host.

SNI the Host Adapter to HPS

SNI adapter

Fig. 9 Logical view of the HPS host adapter (SNI), attaching directly to two GX+ bus connections on the host side and two Switch Ports for the two HPS links, one to each of the two HPS switch planes.

Each p5-575 node connects to the two HPS fabric planes through a host-side switch adapter, called the Switch Network Interface (SNI), shown in Fig. 9. Each SNI has two full-duplex HPS transmission links, one for each switch plane installed at TAMU. On the p5-575 side, each SNI attaches directly to two of its GX+ buses. Each GX+ is a full-duplex I/O bus with 4 bytes per direction, which directly attaches to one Power5+ DCM (see Fig. 7), and runs at 1/3 of the processor clock. The nominal bi-directional throughput of a GX+ bus is rated at 5.067 Gibytes/sec (2.54Gibytes/sec per direction). Notice in Fig. 6, there is one GX+ bus per DCM. For the dual=plane HPS case, the GX+ buses of two DCMs participate in the transmission of HPS messages. Each SNI consists of special ASICs which "off-load" the processors from handling message transmission and reception logic. These ASICs contain specialized multi-threaded communications micro-processor (called Inter-partition Communications processors) which supports several processor off-load capabilities to facilitate high-speed, low-latency, concurrent access to local and remote memories.

One of the interesting design features of the SNI is that it can directly map user (application memory) for direct access, which can avoid expensive user to system memory intermediate data copies whenever possible. Each SNI supports native Remote Direct Memory Access (RDMA) functionality which allows access to memory belonging to remote nodes with little participation by the target node. One-sided messaging (remote put or get) are also supported by h/w.

Each HPS link uses copper media and allows full-duplex communication, with a nominal throughput of 2 Gibytes/sec per direction. The raw throughput required by the two HPS link necessitated the connection of each SNI to two GX+ buses. The fact that the SNI adapter directly attaches to GX+ buses allows low latency access to the node's cache and main memories.

HPS Switch-Board 16X16 Fabric

Internal diagram of HPS switchboard

Fig. 10 The 16X16 HPS Switchboard. Each one of the six TAMU switches is like this one.

A HPS "switch-board" is a bi-directional, 16X16 non-blocking switching fabric (see Fig. 10) which connects 16 Power5 host SNI ports together. A switch-board consists of 8 "Switch-Chips" (SCs) similar to the one shown in Fig. 11.

Switch chip for HPS

Fig. 11 Logical view of the Switch Chip: 8X8 cross-bar with virtual output queuing and 32Kibytes central buffer.

Each 16X16 HPS switch-board is a Bi-directional Multi-Stage Interconnection Network ("BMIN") or a "Fat-Tree". At TAMU, 48 p5-575 nodes are connected together with three HPS switch-boards in a "triangular" topology as shown in Fig. 12. This particular connectivity is necessary to maintain the Fat-Tree topology for systems with more than 16 host ports. Note that Fig. 12 shows one of the two HPS planes installed at TAMU. Each transmission links is bi-directional and can carry data at 2 Gibytes / sec per direction.

HPS inter-switch connectivity

Fig. 12 (a) Schematic detail of the "triangular" HPS connection that makes up one HPS plane. Each one of the three switch-boards connects through 8 HPS links to each of the other two switch-boards. The schematic shows only one of the two planes that TAMU has installed.

HPS node-to-switch connectivity

Fig. 12 (b) Schematic detail of the p5-575 node to switch connectivity. Note that at TAMU special queues (such as, 'cs_group' or 'geo_group') have been assigned to groups of nodes in a way which minimizes the number of "hops" to the 2nd Switch Chip on a switchboard. The mapping on the 2nd plane is identical.

It is clear from Fig. 12 that, even in the absence of path contention, there is different latency between different endpoint node pairs. At TAMU we have made special effort to carve out groups of nodes which are related "close" together with respect to SC hops and bandwidth. For instance, cs_group and geo_group LL job classes are allocated to nodes f2n2, f2n4, f2n6, f2n8, f2n5, f2n7, f2n1, f2n3, f2n10 and f4n6. It can bee seen that MPI code traverses a minimal number of "hops" (or links) among possible end-point nodes. Had we allocated these LL class node groups in a hap-hazard fashion MPI code could suffer unnecessary delays from paths with different lengths and un-even available bandwidth due to contention with irrelevant applications. Another interesting point is that even though there are 16 host end-points attached to the host-side of a switch-board, the cross-section bandwidth between any pair of switchboards is only that of 8 HPS links. An MPI application allocated across different switchboards (say SB1 and SB2) may suffer from this bandwidth bottleneck. Ideally one has to balance the demand and available bandwidth across physical connections among communicating MPI processes.

Adaptive Least-Common Ancestor Routing Algorithms for HPS

SCs implement "cut-through" switching with buffering at the "Central-Queue" only for packets whose destination output port is busy. When a destination output port is idle, packet data flow directly from the input port to this output port ("cut-through switching"). Routing in the HPS is source based and it is adaptive. The source SNI specifies up to four different routes per SC for the path portion reaching the "Least Common Ancestor" (LCA) SC. As a packet travels towards the LCA, the switching SC which receives it selects one out of a predetermined set of possible output ports to forward it. Fig. 13 presents some pertinent examples of this specific routing approach. Notice that there are four distinct paths connecting endpoints (A, B), one path for (D, E) and again four paths for endpoints (C, X).

HPS routing

Fig. 13 Adaptive Least-Common Ancestor Routing examples as they apply to HPS.

At each SC (see Fig. 11), the choice of the particular output port is based on the current load of all output ports which are in the source to destination route. If all output ports are busy, the packet is stored in the SRAM of the central buffer within the SC. A SC will use (or schedule to use) the least loaded output port among the four allowed to be be used at each switching point by the source route. When the packet reaches the LCA switch chip it "turns" and travels downward towards the destination SNI. The path from the LCA to the destination is always unique and there is no adaptivity. An input port has 8Kibytes of packet space for incoming packets. A packet is divided into "flits" (for flow-control digits) and transmitted from an output port to the downstream input port. When a flit is received, the input port sends out an ACKnowledgment to the upstream output port. The output port has one flit "credit" for every flit space that is known to be available at the downstream input port. There are enough flit credits to maintain the transmission link pipeline full of flits in transit. An output port will never transmit more flits than it owns credits in order not to overflow the receiving input port. This is link-by-link "back-pressure" flow-control at the flit-level and it is common in high-speed links. Note that even though flow-control can protect the downstream input ports from overflowing, it is still possible that congestion may form within the HPS fabric. Simply speaking, if two ports request the same output port in a SC, the storage for this output Port in the central queue will immediately get full. When a SC gets congested, a so-called "tree-saturation" forms as the congestion propagates from that SC upstream towards all sending SNIs which have to use this path. The adaptive routing is one of the factors which mitigate the end-to-end congestion problem but it is well known that this is by no means a complete solution. It is LAPI which applies end-to-end congestion control so that the fabric will not get overly congested for lengthy periods of time. LAPI sends up to a number of un-acknowledged packets over the HPS but it will wait until these are eventually get acknowledged by the receiving LAPI sides. This is a "sliding-window" congestion control heuristic which throttles the rate of new packet injection into the HPS network until the older transient packets are removed and acknowledged by their destination LAPI sides. Source routes are pre-computed at HPS boot-up time and may change if any part of the network malfunctions. AIX continuously monitors the sanity of each SNI and HPS port and it re-generates routes in case a path stopped working properly.

The nominal throughput is 16Gibytes/sec per direction per switch-board pair. Note that Fig. 12 (a) shows a fully connected single plane switch topology with 48 p5-575 nodes. TAMU has a total of 52 nodes currently available and 48 nodes are connected to the two switch planes. The above HPS connection topology requires the 48 nodes to be grouped together in three host groups, each with 16 nodes. Each group connects to another group with 8 full-duplex copper HPS links (for each HPS plane). The supports a nominal 16 Gibytes/sec inter-group communication capacity per direction per HPS plane.

Each SC has a low-latency (59 nano-secs per packet), and it can administer automatic link-level retries, among others. There is a 0.4 micro-second latency across a switch-board. Fig. 12 (b) presents details of the p5-575 node to switch connectivity. Note that at TAMU special queues (such as, 'cs_group' or 'geo_group') have been assigned to groups of nodes in a way which minimizes the number of "hops" to the 2nd Switch Chip on a switchboard. The mapping on the 2nd plane is identical.

See also this discussion on the HPS architecture and ways to efficiently use it in this PDF.

HPS Communication Stack and POE (MPI) Code

Power5 clusters with HPS fabric, employ a complex communications protocol stack. User POE (MPI) or LAPI applications can directly use the hardware capabilities of the high-speed SNIs and the HPS by invoking the upper layer of this stack. This is a multi-layer communications s/w which is illustrated in Fig. 14. As with any complex collection of h/w and s/w protocols, discussion of the stack can take place by explaining the functions of each one of the layers. However, a deeper discussion of the HPS stack is very technically involved and far beyond the scope of this write-up. Interested readers may consult this PDF to gain a deeper insight in the inner workings of the underlying HPS stack protocols.

HPS software protocol stack

Fig. 14 The HPS software protocol stack.

HPS Stack Layering

HPS transmission links and the Link-Driver Chips

The transmission links and the Link-Driver Chips on the SNIs and on the switch side adapters, correspond to the physical layer. This layer is responsible for the conversion of binary information to signals capable for high speed digital transmission and reception by the physical links.

Hardware Abstraction Layer (HAL) and SNI Firmware

These two parts roughly correspond to a data-link layer which create, process and consume packets of a maximum size for transmission between two SNI end-points. This layer is not reliable in the sense that packets at this layer are not guaranteed to arrive in order or even arrive at all. Higher layers ensure correct packet sequencing. HAL hand-shakes directly with the SNI's device driver in AIX and with the firmware running on it. It uses pinned-down memory for network buffering.

Note that each point-to-point HPS link is capable of automatic packet retransmission of garbled packets and of credit-based flow-control. An upstream output port cannot transmit a packet unless it has a credit unit sent to it by the downstream input port. As packets are removed from the storage buffer of the input ports, credits are generated and send upstream to the output buffer. The amount of credits available on a down-stream input port is determined by the actual time it takes for signals to propagate over the physical link. Longer links require more credits to be available so that under error-free transmission, the link is always busy. Determination of the amount of available credit per input port takes place during HPS initialization.

Routing Layer

As explained previously, routing for HPS packets is source-based, and dynamic over a fixed collection of multiple paths between each pair of end-points. The route determination logic executes once at HPS initialization time, during which the HPS identifies the connected components and determines the multiple shortest paths among all connected end-points. These multi-path routes are then downloaded to the various switch-board processors and SNIs. During regular packet switching, each SC consults the routing tables to determine the candidate set of output ports to use, and then selects the output port which is "least-loaded" at the time. This spreads the traffic load over time across the four output ports of a switch-board. One undesirable effect of this logic, is that HPS packets may arrive out of order or in duplicates, in which case a higher layer has to make sure that unique packets are sent to receiver tasks in transmission order. A h/w failure signals that certain parts on the HPS network will have to be circumvented. At this stage, alternative fail-over paths are chosen as is the case with the two plane HPS fabric at TAMU.

Low-layer Application Programming Interface (LAPI)

The underlying reliable Transport Layer protocol is provided by the LAPI layer. LAPI is a "single-sided" communication protocol layer which executes mostly in user-space (hence the designation "US"). Considerable effort has been put into making LAPI an efficient communication layer that exposes all hardware capabilities of HPS to applications. LAPI principles of operations and mechanisms lie on the idea of Active Messages (AM). AM was developed in the late 1990s by several University communities as an approach to fashion communications stacks that users' applications can use with minimal overhead. The AM approach is an alternative to heavy weight stacks, such as TCP/IP code running within the kernel.

LAPI executes in the same virtual memory address space and processor context as the application code invoking it. LAPI negotiates directly with the device-driver and the micro-code of the SNI to setup data transfers on behalf of the application. The advantage of protocol stacks executing in this way includes bypassing the AIX kernel, saving expensive processor mode switch and avoiding user to kernel and back data copies. LAPI can also take advantage of the cache memory to avoid DRAM access as much as possible. Specifically, when the user places data in application send buffers, and LAPI is immediately invoked to send the data out (by some MPI or LAPI call) the protocol can read this data directly off of the L2/L3 cache memory and send them off using RDMA through the SNI. On the receiving side, the SNI places ("injects") data arriving to the destination node directly into the L2 cache. MPI application code which can access this data directly off of the L2 cache which has a valid copy of them. See Fig. 7 and Fig. 9 for an illustration of the connection of SNIs to GX+ buses on the SMP nodes. Note that at TAMU, each SNI attaches to two separate GX+ buses simultaneously in order to have full support for the 4+4 GiBytes/sec bandwidth needed by the two HPS ports per node. The notion of POE (and LoadLeveler) "SNI Affinity" can play an important role in the performance of POE code. SNI affinity ensures that the sending and receiving MPI tasks runs on processors with the GX+ buses that the SNI directly attaches.

LAPI supports three high-speed, user-space message transmission modes, which can be directly used by POE (MPI) or LAPI applications. The first one is called FIFO (or "Packet-Mode"), the second one, Remote Direct Memory Access (RDMA or "Bulk Transfer") mode and the third one is called Shared-Memory message passing. The Shared-Memory communication is enabled by default among all POE/MPI tasks which are assigned to run within the same SMP node. This mode avoids using any SNI resources at all and carries out the MPI communications using shared memory segments which are mapped to all communicating tasks.

LAPI also supports an interface with an underlying UDP transport layer such us those provided by any common networking stack. We will not be further discussing these latter two modes. However, performance results for them will also be included later for comparison purposes.

MPCI Layer

The MPCI correspond to a lower Application Layer (with respect to the communications stack). MPCI provides point-to-point communication with message matching, buffering, early arrival message handling, transfers user data to and from system buffers. This is a historical layer whose functions has been largely subsumed by LAPI.

MPI Layer

The MPI corresponds to an upper Application Layer (with respect to the communications stack). MPI enforces semantics of the standard, it decomposes collective into point-to-point communications and implements group communication semantics. It invokes MPCI.

MPI Blocking Communication Mode MPI supports blocking and non-blocking communications.

MPI Message Passing Transfer Mode Implementation MPI standard specifies four inter-task message transfer modes, for point-to-point communications, across different process address spaces, as follows.

In the "standard" send/receive data flow, which is illustrated in Fig. 18, MPI

MPI Message Flow

Fig. 18 POE/MPI Message Transmission Model.

  1. Copies user data into MPI/LAPI managed buffers.
  2. Underlying communications transport transfers data to system buffer in receiving MPI process.
Blocking send returns successfully after step (1) or (2).

MPI Message Buffering Mode ImplementationMPI uses two basic high-level protocols to implement message buffering at the end-points.

MPI's communications modes are internally implemented as follows

Load Leveler

LoadLeveler (LL) is the "Resource Manager" layer for the cluster. LL receives requests by the users to execute computation jobs on the cluster. All resources a job requires are listed in detail, along with the logic to invoke executables, within a LL job script file which a user provides as input to LL. Resources may be processors and DRAM memory on one or more SMP nodes, or HPS communications resources and specifications of amount of time these resources will be allocated to this job. LL Jobs can be "serial" or "parallel".

Upon a job submission, LL examines its resource requirements and schedules it for execution, as soon as, all of the requested resources become available. If these are presently available, LL launches the job immediately. If not all are available at submission time, the job remains in the scheduler queue in the pending state. Jobs in this queue are reordered "continuously" (i.e., upon every job submission, termination and every 300 seconds) by LL, based on an attribute called SysPriority. In short, SysPriority is a function of the cluster resources which this user has consumed in "the recent past". This is called job scheduling by Fair-Share (FS). The objectives behind FS are simple. If a user has recently used more than "his fair share" of system resources, the system throttles back the rate that resources may be consumed by him. In an ideal environment with say K users, cluster resources have to be consumed at the rate of 1/K over time. If user A's consumption over time, is > 1/K and user's B is < 1/K, LL will increase proportionally the SysPriority of B and decrease the SysPriority of A.

With LL, usage is based on total CPU computation time (that is NOT on Wall Clock time) and at TAMU, this running accumulated time is continuously "decayed" exponentially, so that the present usage diminishes to the %5 value after a period of two weeks time. This means that when a user stops consuming any CPU time after having used a total of, say, 1000 minutes, after 14 days LL will only see 0.05X1000 = 50 minutes of usage. This has the following implication. Users who submit many jobs and consume quickly their fair-share, will start receiving lower priority in the ordering of the scheduler queue. Conversely, users who have not used recently any CPU will be given much higher priority so they can get to the resources they require ahead of heavy users.

Since usually some leftover resources are available, LL will try to schedule smaller jobs to use them possibly before other jobs in the scheduler queue. This is called "BackFill" scheduling and it increases the overall system utilization.

HPS Fabric and MPI Performance

We present below our preliminary performance results on the achievable "bandwidth" of MPI message passing communication over the HPS fabric. We examined the different communication modes which are available on the 1600 cluster. We present the performance of both blocking and non-blocking MPI calls using the different POE message transmission modes and HPS options. Note that there are several tuning options for POE and LAPI which we have used (e.g., single thread, polling, non-polling modes, interrupt enabled, etc.) whose detailed explanation is beyond the scope of the present write up. Specifically, we bench-marked the following message transmission modes.

The results are reported below in three PostScript files, each emphasizing a different message size region to illustrate better the performance curves. Note that this results are preliminary and under continuous development. At a later time the results will be expanded, explained and analyzed in a formal report. However, these preliminary results could already provide some good insight on the performance capabilities and used to provide rough guidelines as to how MPI code can set the various POE/LAPI options.

DDN S2A9550 Cluster Storage and GPFS File Systems

Hydra is directly attached to 20 Tera-bytes of disk space on a high-performance Data-Direct Network's DDN A29550 RAID array (see Fig. 21). The connection to the DDN RAID is through eight 4-Gigabit/sec fibre-channel links, with two FC links hosted by each of four I/O p5-575 server nodes. On the RAID side (see Fig. 22), each logical disk (LUN) is protected with two parity disks for increased recovery capabilities, far beyond the usual N+1 RAID configurations.

HPS DDN layout

Fig. 21 The 48 p5-575 TAMU cluster with additional details for the connectivity of the four GPFS I/O server nodes to the DDN S2A9550.

DDN internal diagram

Fig. 22 Details of the DDN S2A9550 RAID array.

The Hydra cluster deploys one of the latest version of GPFS which is IBM's high-performing, highly-scalable clustered file system. Currently, /home, /usr/local, /work and /scratch are GPFS file systems available to the users of hydra and are configured with respect to different performance objectives. /work and /scratch are striped four-ways across the four GPFS I/O servers. The number of I/O servers as well as the topology of their connectivity with the rest of the systems was chosen with respect to performance. Four I/O servers can fully utilize the raw throughput capacity of the high-end DDN S2A9559 RAID. The topology of the four GPFS servers is meant to spread the file traffic evenly over the HPS paths to all nodes, as evenly as possible. One can notice in the diagram in Fig. 21 that each cluster node has the same path length distance to all four GPFS I/O servers to avoid the instance were traffic has to wait for the longest path when the four paths have unequal lengths.

I/O Performance Results for Hydra and DDN S2A9550

Performance experiments conducted after these modifications have demonstrated that now Hydra cluster is performing at (or better than) the highest published I/O throughput levels for this type of hardware and RAID Storage.

Specifically, aggregate I/O read throughput at raw device level in the I/O nodes, conducted when Hydra was under full production workload, is at ~ 2600.7 Megabytes / sec. This is very close to the "theoretical" maximum performance the vendor DDN claims that the storage array can ever achieve under ideal conditions.

Preliminary performance experiments on IBM's parallel file system (GPFS) running on hydra, have reached ~ 2323.89 Megabytes/sec sustained, which is better than the ~ 2000 Megabytes/sec. IBM has achieved with their top of the line systems under ideal conditions.

It is expected that the same experiments conducted under an idle system can reach higher levels.

Please inquire with the facility staff if you would like to have your code achieve the best possible performance with respect to GPFS I/O.

Standard Benchmark Performance Results for Power5+ and p5-575

Before the introduction of the Power6 or Power7 processor, p5-575 servers using the Power5+ processor achieved top rankings in a number of standard industry and application benchmarks. The p5-575 achieved the highest SPECfp_rate2000 measurement and the highest SPECompM2001 measurement for any 8-core server. Furthermore, both the 8-core and 16-core p5-575 systems had higher memory bandwidths than other 8-core and 16-core high-density RISC-based systems, respectively . The table below shows some standard benchmarking numbers for individual p5-575 nodes. We could say that each one of the 52 p5-575 nodes is approximately 2.5 to 3 times as fast as our former Power4-based IBM p690 (agave) system.

Multi-user Performance (AIX V5.3)

Processor(s) GHz L3 cache
Proc Mem Bandwidth SPEC (CPU2000) SPEC web99 SPEC web99 SSL
int rate int rate base fp rate fp rate base
16-core Power5+ 1.9 288 202.7 GB/sec 314 310 571 541

SPEC and LINPACK Performance (AIX V5.3)

Processor(s) GHz L3 cache
int int base fp fp base DP TPP HPC
1-core Power5+ 1.9 36 1,526 1,473 3,042 2,830 1,315 7,140
16-core Power5+ 1.9 288 111,400

Cluster Configuration Summary

Component Specifications
Total number of processors 832 Power5+ at 1.9GHz
Total physical memory 1,760 Gibytes DDR2 DRAM at 533MHz
Memory architecture within p5-575 node Cache Coherent, Almost-Uniform Memory Access (cc-AUMA)
Main Memory (DRAM) 32 GB (49 p5-575 nodes), 64 GB (3 p5 575 nodes); 64 DIMMs at 533 MHz
2 SMI II (with 4 DIMMs per SMI) at 1066MHz per SMI per Power5+ chip;
Memory architecture across p5-575 nodes Distributed-Memory cluster supporting Message-Passing communications via the High-Performance Switch for MPI and LAPI code
Operating system AIX Version 5.3 with Parallel Environment, CSM, RSCT and GPFS.
Processor type, ISA 1.9 GHz IBM Power5+ ® μ-processor; 64-bit PPC Architecture; Big Endian
Cache memories (on processor die
or on DCM/MCM package)
Level Total size Block Size Associativity Write policy Access latency
L1 (one per core) 64KiB instr 128 bytes 2 N/A 2
32KiB data 128 bytes 4 write-through 2
L2; shared by both cores 1.875MiB unified; 3 independent slices 128 bytes 10 (512 GC) write-back 12
L3; shared by both cores 39 MiB unified; 3 independent slices 512 bytes (2X128 Byte blocks) 12 write-back, victim cache for L2 80
Number of physical processors / node 16 [Fig. 1]
Size of local memory / node 32 Gigabytes DDR2 at 533MHz [Details in Fig. 7]
Number of Processors per node 16 (8 X DCMs) [DCM Fig. 7]
Number of p5-575 nodes 52
Cache coherence protocol within node Enhanced-Distributed Switch (cc-AlmostUMA)
Interconnection between p5-575 nodes for message passing IBM® High-Performance SwitchTM
  • Dual-plane, HPS triangular topology [Fig. 12]
  • Three HPS switches connected together per plane;
  • Eight HPS links per switch-pair per plane;
  • 48 p5-575 nodes connected to the HPS switches
  • Host HPS Adapter: SNI with two bi-directional ports [Fig. 9];
  • Each HPS is 16X16 with 16 host and 16 inter-switch HPS ports [Fig. 10]
  • Each HPS switchboard consists of 8 Switch Chips [Fig. 11] which are 8X8 cross-bar switching elements, with virtual output queuing and 32KiB central queue for blocked packets
  • 2 Gibytes/sec per direction per link
  • 59 nano-seconds latency per switch chip
  • path length is 1, 2 or 4 switch-chips depending on the number of switch chips the path crosses, in the origin-destination node pair.
Number of I/O nodes 4 GPFS I/O server nodes; each with two 4Gb/sec Fibre Channel adapter
I/O and PCI-X slots / node Per p5-575 node:
  • 3 X PCI-X buses (64-bit, 133 MHz),
  • 1 dual SNI port HPS adapter attached to two GX+ buses [Fig. 9]
  • 2 X 146.8 Gibytes Ultra320 SCSI disk drives
Networking Four 1-Gigabit (copper) ethernet ports per node
System disks Two 146.8 GB SCSI disks (40 nodes); Two 73.4 GB SCSI disks (12 nodes)
Disk expansion unit DDN S2A9550 20+ Terabyte RAID array (2 parity disks/LUN)


Published Literature for IBM Cluster 1600 and Power5+

  1. B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner," POWER5 System Microarchitecture," in IBM Journal of Research and Development, Vol. 49, Num. 4/5, 2005, pp. 505--521.
  2. H. M. Mathis, A. E. Mericas, J. D. McCalpin, R. J. Eickemeyer, and S. R. Kunkel, "Characterization of Simultaneous Multithreading (SMT) Efficiency in POWER5," in IBM Journal of Research and Development, Vol. 49, Num. 4/5, 2005, pp. 555--564.
  3. H. Q. Le, W. J. Starke, J. S. Fields, F. P. O. Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, M. T. Vaden, "IBM POWER6 Microarchitecture," in IBM Journal of Research and Development, Vol. 51, Num. 6, 2007, pp. 639--662.
  4. J. M. Tendler, J. S. Dodson, J. S. Fields, Jr., H. Le and B. Sinharoy, "POWER4 System Microarchitecture," in IBM Journal of Research and Development, Vol. 46, Num. 1, 2002, pp. 5--26.
  5. Rajeev Sivaram, Rama K. Govindaraju Peter, Hochschild Robert, Blackmore, Piyush Chaudhary, "Breaking the Connection: RDMA Deconstructed," in Proc. of the 13th IEEE Symp. on High Performance Interconnects, (HOTI'05), 2005.
  6. D. W. Victor, J. M. Ludden, R. D. Peterson, B. S. Nelson, W. K. Sharp, J. K. Hsu, B.-L. Chu, M. L. Behm, R. M. Gott, A. D. Romonosky and S. R. Farago, "Functional Verification of the POWER5 Microprocessor and POWER5 Multiprocessor Systems," in IBM Journal of Research and Development, Vol. 46, Num. 1, 2005, pp. 541--553.
  7. R. M. Gott, J. R. Baumgartner, P. Roessler and S. I. Joe, "Functional Formal Verification on Designs of pSeries Microprocessors and Communication Subsystems," in IBM Journal of Research and Development, Vol. 49, Num. 4/5, 2005, pp. 505--521.
  8. R. K Govindaraju, P. Hochschild, D. Grice, K. Gildea, R. Blackmore, C. A. Bender, C. Kim, P. Chaudhary, J. Goscinski, J. Herring, S. Martin, J. Houston, "Architecture and Early Performance of the New IBM HPS Fabric and Adapter," in Lecture Notes in Computer Science, Volume 3296, pp. 156--165, Springer-Verlag, Dec 2004.
  9. M. Banikazemi, R. K. Govindaraju, R. Blackmore, and D. K. Panda, "MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems," in IEEE Trans. Par. Distr. Systems, Vol. 12, Num. 10, pp. 1081--1093, 2001.
  10. Bulent Abali, Craig B. Stunkel, Jay Herring, Mohammad Banikazemi, Dhabaleswar K. Panda, Cevdet Aykanat and Yucel Aydogan, "Adaptive Routing on the New Switch Chip for IBM SP Systems", in Journal of Parallel and Distributed Computing, Vol. 61, Num. 9, pp. 1148--1179, Sep. 2001.
  11. G. Shah, J. Nieplocha, J. H. Mirza, C. Kim, R. J. Harrison, R. K. Govindaraju, K. J. Gildea, P. DiNicola, and C. A. Bender, "Performance and Experience with LAPI -- A New High-Performance Communication Library for the IBM RS/6000 SP," in Proc. of IEEE Combined IPPS/SPDP 1998, pp. 260--266.
  12. Y C. B. Stunkel, J. Herring, B. Abali and R. Sivaram, "A New Switch Chip for IBM RS/6000 SP Systems", in Proc. ACM/IEEE 1999 Supercomputing Conference, pp. 16, 13-18 Nov. 1999.
  13. Bulent Abali, "A Deadlock Avoidance Method for Computer Networks," in Proc. CANPC '97: First International Workshop on Communication and Architectural Support for Network-Based Parallel Computing, pp. 61--72, Springer-Verlag, 1997, London, UK.
  14. M. Snir, P. Hochschild, D. D. Frye, K. J. Gildea, "The communication software and parallel environment of the IBM SP2," IBM Systems Journal, Vol. 34, Issue 2, page 205, 1995.
  15. C. B. Stunkel, D. G. Shea, B. Abali, M. G. Atkins, C. A. Bender, D. G. Grice, P. Hochschild, D. J. Joseph, B. J. Nathanson, R. A. Swetz, R. F. Stucke, M. Tsao, P. R. Varker, "The SP2 High-Performance Switch," IBM Systems Journal, Vol. 34, Issue 2, 1995.
  16. C. B. Stunkel, D. G. Shea, D.G. Grice, P. H. Hochschild and M. Tsao, "The SP1 High-Performance Switch," in Proc. of Scalable High-Performance Computing Conference, pp. 150--157, May, 1994.
  17. C. B. Stunkel, D. G. Shea, B. Abali, M. M. Denneau, P. H. Hochschild, D. J. Joseph, B. J. Nathanson, M. Tsao and P. R. Varker, "Architecture and Implementation of Vulcan, in Proc. of the Inte'l Par. Proc. Symp." (IPPS94) , April 1994.

IBM RedBooks, White Papers and Technical Documents

  1. IBM WhitePaper: IBM System p5 Quad-Core Module Based on POWER5+ Technology: Technical Overview and Introduction
  2. IBM WhitePaper: Guide to Multiple Page Size Support on AIX 5L Version 5.3
  3. IBM RedBook: An Introduction to the New IBM eServer pSeries High Performance Switch
  4. IBM WhitePaper: IBM High Performance Switch on System p5 575 Server - Performance, White Paper
  5. IBM WhitePaper: IBM eServer pSeries High Performance Switch - Tuning and Debug Guide
  6. IBM Reference Documentation, Switch Network Interface for eServer pSeries High Performance Switch Guide and Reference
  7. IBM Reference Documentation, Parallel Environment for AIX, Operation and Use, Vol. 1, Copyright IBM Corporation 1990, 2007.
  8. IBM Reference Documentation, Parallel Environment for AIX, Operation and Use, Vol. 2, Copyright IBM Corporation 1990, 2007.
  9. IBM Reference Documentation, MPI Programming Guide, Copyright IBM Corporation 1990, 2007.
  10. IBM Reference Documentation, MPI Subroutine Reference,, Copyright IBM Corporation 1990, 2007.
  11. IBM Reference Documentation: RSCT LAPI Programming Guide

Collection of IBM Documentation, Technical Manuals and Libraries



The discussions in this page relies on numerous technical sources, including, papers published in the research literature, IBM technical reports, manuals and personal communications with IBM developers and researchers. See References section above. The contents is responsibility of Michael E. Thomadakis and along with the original artwork remain (C) copyright of his and of Texas A & M University's. Figure 22 is courtesy of Data Direct Networks, Inc. Any of the contents of this page can be freely used for educational purposes, as long as the copyright notice remains visible and the original author is cited.

Disclaimer: This page is Under Construction. Visit often for corrections and additions. Contact me at miket(at)tamu.edu or at miket(at)sc.tamu.edu for corrections and additions.