Hydra Cluster Tentative Availability and H/W State

Dear Hydra users, this is an update on the latest Hydra cluster RAID disk system maintenance.

We recovered the DDN controller B side which was misbehaving and caused the GPFS to crash back on 09/11/2012 afternoon.

We are opening Interactive and Batch processing on Hydra on a tentative basis. You are strongly advised to seriously consider the operational stability of this system which is highlighted below.

Node f3n9 suffered irrecoverable H/W failure and shut-down itself permanently on 09-10-2012 around 10:40AM. This is the latest of Hydra nodes to crash permanently due to electronics which have passed their designed life-span by more than a year. This brings the total of incapacitated nodes to 7. Since we had ALL 7 nodes crashing within 9 months we can venture to speculate that more are imminent sooner than anyone is hoping for.

Also recall that, node hydra2.tamu.edu, which on Sun 08-05-2012 suffered H/W failures, has been removed from service. This node was one of the three remaining high-speed GPFS I/O nodes serving out /scratch and /work to the compute nodes. After strenuous efforts we managed to re-establish access to these file systems through the remaining two GPFS I/O nodes.

Since January 2012 we have been experiencing an accelerating rate in permanent node failures. At this time only TWO GPFS I/O nodes are operational so GPFS will be operating without any redundancy for the high-speed file systems /work and /scratch and with reduced performance. This implies that

  1. user applications could experience a degradation by at most a factor of 2 in file operation performance,
  2. as soon as f1n2 or f1n9 suffer any failure, GPFS and likely the entire cluster, will immediately become unavailable for an un-predictable amount of time, including becoming unavailable permanently,
  3. only one login node is now available, namely hydra1.tamu.edu, and
  4. the operation of the "interactive" queue is suspended to relieve pressure to node hydra1.tamu.edu which is also one of the two remaining GPFS I/O nodes.

In recent days the Hydra cluster suffered several permanent H/W failures with, unfortunately, accelerated rates. Hydra is a highly proprietary system, operating without H/W and S/W support by the vendor since December of 2009. As such the resources to carry out necessary corrective and preventive H/W maintenance are very limited. In November of 2012 Hydra would be in continuous operation for six full years. In general, Hydra should not be considered as the development and computing platform for new applications.

Currently the following nodes have failed permanently:

f1n1, f1n7, f1n8, f1n10, f2n10, f3n9 and f3n10.

The following nodes have parts of their main memory deactivated by the system itself after they failed:

f2n3, f3n2, f4n5, f4n8, f5n1, f5n8.

We strongly advise you to maintain copies of your valuable files on systems outside Hydra.

Posted on: 1:41 PM, September 13, 2012