Node Unavailable for Interactive MPI

At the moment the alternative interactive node (or f1n10 in terms of the cluster environment) cannot execute interactive MPI ("POE") executables. It is available for any other normal interactive processing, including test running any non-MPI executables.

Since 01/01/2012 when the main interactive node crashed "ungracefully" due to h/w failures, we noticed some unstable behavior of the HPS switch and the corresponding HPS adapters ("SNIs") on several nodes. That stalled POE (MPI) on those nodes and the only way to clear this was by power cycling the affected nodes as we did on 01/13/2012.

Currently is presenting the same symptoms as those other nodes where stale LL jobs are occupying SNI (HPS adapter) resources which then stalls POE on The only way to clear this is by power cycling

However, node is also serving very critical GPFS file I/O and other system functions to the rest of the cluster and cannot be physically powered off and on.

We are in the process of transitioning the cluster to use another node for GPFS I/O and take over the role of the failed Until the process reaches the point we have to shutdown and reboot, we cannot fix the interactive POE problem on this node.

In the meantime please you can use the newly opened node for normal interactive POE processing. Please see previous announcement for details.

Posted on: 11:52 AM, January 20, 2012