En:lmgpu-info

This is an old revision of the document!

The Lamarr GPU Cluster

The Lamarr GPU Cluster

We are building and expanding a GPU Cluster as part of the Lamarr Institute. At the moment, the cluster consists of seven nodes with the following basic configuration:

AMD Epyc 7713 (64 Cores)
2TB System-RAM
8x A100 SXM4 HGX w/each 80GB GPU-Memory, NVLink (600GB/s) connected
10GbE Network

The nodes itself cannot be accessed directly, we have prepared a login node for this purpose, which can be utilized as a staging area for your code and data.

All nodes and the login node share a common data storage area (332TB, cephfs), which is mounted as /home/ and set as the standard user homes. Shared datasets could be placed there and made available to selective groups of users via standard Posix file permissions/user groups.

The workload management software/scheduler is SLURM. Please refer to the user documentation of this software for further information about SLURM:

https://slurm.schedmd.com/documentation.html

The SLURM job time limits are set to a maximum of 72h, with a default job time limit of 3h.

Full container support is enabled. We recommend the usage of enroot as container backend for SLURM. If you need a personalized library or OS setup around your software, containers will be your way of choice.

How can I access the cluster?

At the moment the cluster is available to all Principal Investigators of the Lamarr Institute. If you want to gain access, please contact us per email and request an account for you and/or your colleagues. Please do not forget to specify the nature your personal association to the Lamarr Institute, so we can verify the validity your access requests.

As soon as you received the confirmation email from us, you can access the cluster through the login-node using the SSH protocol. All necessary additional information regarding the login will be contained in this email.

Example of custom software environments using containers

Obviously you will not get administrative privileges on the cluster to setup your personal software environment. But you can import containers to setup any libraries or frameworks you need to use. We tested the cluster setup using enroot from NVidia, which is integrated into SLURM quite well. Any other container environment supported by pyxis should work, too, but was not tested by us yet.

In enroot, you import first import external container images, but those images are immutable. You have to create a container and start it read/write to make persistent changes to it.

First you import an image (i.e. ubuntu from docker registry):

$ enroot import -o ~/ubuntu.sqsh docker://ubuntu

then you create an container from the image (ubuntu.sqsh):

$ enroot create --name mycontainer ~/ubuntu.sqsh

Now you start the container and install everything you need:

$ enroot start -r -w mycontainer

When you are finished, create a new sqsh Image from your container:

$ enroot export --output ~/mycontainer.sqsh mycontainer

Now you can run software via SLURM inside the container like this:

$ srun --container-image ~/mycontainer.sqsh /script-in-the-container

Of course you can create your custom image (mycontainer.sqsh) on any host you like, it does not have to be lmgu-login. You can then transfer the .sqfs File to the login node and start your container via srun just like shown above.

Exchange with other users

There is a (federated) matrix chatroom for all users of the Lamarr GPU Cluster, please feel free to join if you like:

#lmgpu-tech:matrix.informatik.uni-bonn.de

En:lmgpu-info

Table of Contents

The Lamarr GPU Cluster

How can I access the cluster?

Example of custom software environments using containers

Exchange with other users