Texas A&M Supercomputing Facility Texas A&M University Texas A&M Supercomputing Facility

Emergency Maintenance for Hydra node (hydra2) UPDATE

Hydra node f1n10 (hydra2.tamu.edu) has suffered H/W failure(s) in its "Power and Cooling" subsystem this past Sunday, 08/05/2012. Error messages from this node indicate "power supply failures" which may also imply a more pervasive I/O back-plane failure.

Node f1n10 is critical for the GPFS operation of the entire Hydra cluster. It is one of the remaining three GPFS I/O nodes serving out /scratch and /work. It is also, along with f1n9 (hydra1.tamu.edu), an interactive log-on node.

We attempted to isolate node f1n10 from the Disk RAID I/O subsystem (DDN9550) to allow a backup node take over its GPFS I/O operations. However, as of now the DDN9550 has not allowed this transition to complete successfully.

We are working with DDN to evaluate the situation on possible H/W malfunctioning along the I/O path from the cluster to the DDN storage. At this stage GPFS file systems off of the DDN9550 are unstable on Hydra and they have been shut down.

We will make further announcements as things develop.

Posted on: 7:22 PM, August 6, 2012