Texas A&M Supercomputing Facility Texas A&M University Texas A&M Supercomputing Facility

Hydra Cluster H/W State

Dear Hydra Users,

in recent days the Hydra cluster suffered several permanent H/W failures. Since Hydra has been out of H/W and S/W maintenance as of December of 2009, we are able to fully implement only a small portion of the necessary preventive or corrective H/W maintenance.

Currently the following nodes have failed permanently (and would require full replacement of their motherboards or I/O backplane):

f1n1, f1n7, f1n8, f2n10, f3n10.

The following nodes have parts of their main memory deactivated by the system itself after they failed:

f2n3, f3n2, f4n5, f4n8, f5n1, f5n8.

Recently we have been receiving messages of failing components on the "Power Distribution and Control" infrastructure of Frame 1 which houses all important GPFS and cluster nodes: f1n2, f1n9 and f1n10. We have also been receiving warning messages of individual components in the nodes and the switch that will likely fail in the future unless maintained.

If the Frame 1 supporting infrastructure fails completely then the system is going to be completely unusable as two of the main switches will be not functioning at all. If any of the three GPFS/VSD I/O nodes (f1n2, f1n9, f1n10) fails, then Hydra will be inaccessible for a long period of time due to requiring complete GPFS and I/O storage node re-installation.

However, the fact that nodes keep failing make likely that even with a full re-installation, the next node(s) failures will bring us to a similar unstable state.

In spite of H/W component deterioration, valiant effort by the staff has kept this system in stable and production for very long periods of the time and with minimal down-times. However, the recent rapid acceleration of complete node failures, is an indication that this system is approaching the end of its sustainable lifetime against everyone's hopes.

Users are strongly advised to maintain copies of their valuable Hydra files to stable storage outside this system. Please let us know if you have any questions on this issue.

Posted on: 10:08 AM, July 26, 2012