Hydra Cluster Tentative Availability and H/W State

Dear Hydra users, this is an update on the Hydra cluster resuscitation attempts.

We are opening Hydra on a tentative basis but you are strongly advised to seriously consider the operational stability of this system which is highlighted below.

Node hydra2.tamu.edu, which on Sun 08-05-2012 suffered H/W failures, has been removed from service. This node was one of the three remaining high-speed GPFS I/O nodes serving out /scratch and /work to the compute nodes. After strenuous efforts we managed to re-establish access to these file systems through the remaining two GPFS I/O nodes.

We have recently experienced an accelerated rate in permanent node failures. At this time only TWO GPFS I/O nodes are operational so GPFS will be operating without any redundancy for the high-speed file systems /work and /scratch and with reduced performance. This implies that

  1. user applications could experience a degradation by at most a factor of 2 in file operation performance,
  2. as soon as f1n2 or f1n9 suffer any failure, GPFS and likely the entire cluster, will immediately become unavailable for an un-predictable amount of time, including becoming unavailable permanently,
  3. only one login node is now available, namely hydra1.tamu.edu, and
  4. the operation of the "interactive" queue is suspended to relieve pressure to node hydra1.tamu.edu which is also one of the two remaining GPFS I/O nodes.

In recent days the Hydra cluster suffered several permanent H/W failures with, unfortunately, accelerated rates. Hydra is a highly proprietary system, operating without H/W and S/W support by the vendor since December of 2009. As such the resources to carry out necessary corrective and preventive H/W maintenance are very limited. In November of 2012 Hydra would be in continuous operation for six full years. In general, Hydra should not be considered as the development and computing platform for new applications.

Currently the following nodes have failed permanently:

f1n1, f1n7, f1n8, f1n10, f2n10, f3n10.

The following nodes have parts of their main memory deactivated by the system itself after they failed:

f2n3, f3n2, f4n5, f4n8, f5n1, f5n8.

We strongly advise you to maintain copies of your valuable files on systems outside Hydra.

Posted on: 5:38 PM, August 9, 2012