Texas A&M Supercomputing Facility Texas A&M University Texas A&M Supercomputing Facility

Node hydra2.tamu.edu Unavailable for Interactive MPI

At the moment the alternative interactive node hydra2.tamu.edu (or f1n10 in terms of the cluster environment) cannot execute interactive MPI ("POE") executables. It is available for any other normal interactive processing, including test running any non-MPI executables.

Since 01/01/2012 when hydra.tamu.edu the main interactive node crashed "ungracefully" due to h/w failures, we noticed some unstable behavior of the HPS switch and the corresponding HPS adapters ("SNIs") on several nodes. That stalled POE (MPI) on those nodes and the only way to clear this was by power cycling the affected nodes as we did on 01/13/2012.

Currently hydra2.tamu.edu is presenting the same symptoms as those other nodes where stale LL jobs are occupying SNI (HPS adapter) resources which then stalls POE on hydra2.tamu.edu. The only way to clear this is by power cycling hydra2.tamu.edu.

However, node hydra2.tamu.edu is also serving very critical GPFS file I/O and other system functions to the rest of the cluster and cannot be physically powered off and on.

We are in the process of transitioning the cluster to use another node for GPFS I/O and take over the role of the failed hydra.tamu.edu. Until the process reaches the point we have to shutdown and reboot hydra2.tamu.edu, we cannot fix the interactive POE problem on this node.

In the meantime please you can use the newly opened node hydra1.tamu.edu for normal interactive POE processing. Please see previous announcement for details.

Posted on: 11:52 AM, January 20, 2012