ACES Support for AI and ML

AI/ML Training Programs

To support users transitioning their AI/ML workflows to the ACES computing cluster, we provided the short course "ACES: AI Technology Labs: Utilizing AI Frameworks in Jupyter Notebook". It includes four sessions aimed at assisting new users to start a machine learning project on the ACES cluster at the Texas A&M High Performance Research Computing. Participants were guided through several key topics including module loading using the jupyter lmod extension, data manipulation and visualization using Pandas and Matplotlib, practical applications of linear regression and classification utilizing Scikit-learn, and the creation and training of a basic image classification model employing deep neural networks (DNN) with Keras. For more detailed information and associated materials, please refer to Training:AI Tech Lab.

We also offer Graphcore IPU training workshops to train researchers on converting their PyTorch and TensorFlow models to run on IPUs. For more detailed information and associated materials, please refer to Training:IPU Workshop.

PyTorch and TensorFlow modules

PyTorch and TensorFlow are two of the most widely used deep learning frameworks. They provide tools and libraries to build, train, and deploy machine learning and deep learning models. On the ACES cluster, we provide various versions of PyTorch and TensorFlow modules to satisfy user requirements. For example, these versions of PyTorch are installed on ACES cluster:

    PyTorch/1.10.0-CUDA-11.3.1
    PyTorch/1.12.0-CUDA-11.7.0
    PyTorch/1.12.0
    PyTorch/1.12.1-CUDA-11.3.1
    PyTorch/1.12.1-CUDA-11.7.0
    PyTorch-Geometric/2.1.0-PyTorch-1.12.0-CUDA-11.7.0
    PyTorch-Lightning/1.7.7-CUDA-11.7.0
    PyTorch-Lightning/1.8.4

You can lean more about how to find and load these modules on our SW:Modules page.

Horovod modules

Horovod is an open-source distributed deep learning framework developed by Uber Technologies. It is designed to facilitate distributed deep learning training, especially when dealing with large datasets and models. Several versions of Horovod are installed on the ACES cluster:

    Horovod/0.22.1-CUDA-11.3.1-TensorFlow-2.6.0
    Horovod/0.28.1-CUDA-11.7.0-TensorFlow-2.11.0
    Horovod/0.28.1-CUDA-11.7.0-PyTorch-1.12.1

Nvidia CUDA modules

CUDA is a parallel computing platform and application programming interface (API) model created by NVIDIA. It allows developers to harness the computational power of NVIDIA GPUs (Graphics Processing Units). Different versions of CUDA are installed on the ACES cluster:

    CUDA/11.1.1
    CUDA/11.3.1
    CUDA/11.4.1
    CUDA/11.7.0
    CUDA/11.8.0
    CUDA/12.0.0
    CUDA/12.1.0
    CUDA/12.1.1
    CUDA/12.2.0

Graphcore IPU user guide

There are both the Graphcore Colossus and Bow Intelligence Processing Unit (IPU) Pod16 systems on the ACES cluster. You can ssh directly into the Graphcore nodes to run your machine learning workloads.

You can learn how to use the Graphcore Colossus IPUs with examples from ACES:Graphcore_Colossus_IPU.

You can learn how to use the Graphcore Bow IPUs with examples from ACES:Graphcore_Bow_IPU.

You can run many models for various tasks on Graphcore IPUs (as shown in the figure below). Graphcore regularly updates their Model Garden (link). ACES_Graphcore_Model_Garden

Shared Datasets on Graphcore IPU systems

We provide some popular datasets for computer vision and language modeling tasks on Graphcore IPU systems including ImageNet, Wikipedia, SQuAD and others for users to download.