Hydra Off-line for Maintenance on 02/02/2012

The Hydra IBM cluster will be unavailable for maintenance on Thu 02/02/2012 starting as early as 9:00AM. Interactive and batch will be unavailable for the entire day. However, this is the best-case scenario.

NOTE: The Hydra cluster has been operating without software or hardware support since Dec. of 2009. This makes attempts to repair system components a best effort task. Due to the nature and complexity of the maintenance there is some probability of having to reconstitute the GPFS file systems or even the entire Hydra cluster. We strongly advise all users to back-up before we initiate the maintenance all of their critical data on hydra:/scratch to storage not attached to Hydra, and ensure that their backups finish before 9:00AM, 02/02/2012.

Given the above, there is likelihood of Hydra cluster remaining unavailable for several days, weeks or even getting permanently decommissioned.

The Hydra cluster will continue operating with only 3 GPFS I/O nodes (out of the regular 4), until another node can replace the 4th failed one in an sustainable fashion. The GPFS file system is currently operating without I/O redundancy. If for any reason node f1n2 fails, then GPFS /scratch and /work will become unavailable.

A number of hardware and software maintenance tasks are scheduled for that day, including:

  1. Physical power cycle of the entire Hydra h/w to ensure that node f1n1 cannot be revived.

    If it is determined that f1n1 is permanently incapacitated, we will proceed with the preparation of another alternative node to take over the role of f1n1, including among others, replacing it as a GPFS and VSD I/O server.

    The power cycle is expected to clear recent HPS misbehavior with stale jobs no properly releasing HPS adapter resources and then stalling LL.

  2. The replacement of f1n1 by another node.

    One node from frame "f1" will be permanently removed and installed to take completely over the role of f1n1.

    This will require moving the connectivity from f1n1 to DDN storage to this new node and installing special AIX software to let it act as a host to Fibre-Channel connections.

    After the above step, the AIX I/O software stack will be reconfigured to allow the new node serve as a VSD and GPFS I/O server. This step will basically require redefining several components in the s/w hierarchy without taking apart GPFS and re-constituting it.

    After these steps the rest of the cluster may be brought up to operate if it can be determined that production mode can be sustained. At the same time we will be re-installing the replacement node and testing it to fully take over the role of f1n1.

    Until the replacement node is determined that it can fully replace f1n1, the Hydra cluster will continue to operate with only 3 GPFS I/O nodes as currently does.

Posted on: 6:33 PM, January 26, 2012