====== The Lamarr GPU Cluster ======

We are building and expanding a GPU Cluster as part of the [[https://lamarr-institute.org/|Lamarr Institute]]. At the moment, the cluster consists of nine nodes with the following basic configuration:

2 AMD Epyc 7713 Processors (64 Cores * 2 Threads each => 256 Cores per Node, 248 available for jobs)\\
2TB System-RAM\\
28TB local SSD storage\\
8x A100 SXM4 HGX w/each 80GB GPU-Memory, NVLink (600GB/s) connected\\
10GbE Network\\

The nodes itself cannot be accessed directly, we have prepared a login
node for this purpose, which can be utilized as a staging area for
your code and data.

All nodes and the login node share a common data storage area
(332TB, ZFS), which is mounted as /home/ and set as the standard user
homes. Shared datasets could be placed there and made available to 
selective groups of users via standard Posix file permissions/user 
groups.

All nodes have 28TB local SSD storage available, mounted under /data. 
Please remove your data from there after you are finished with your 
calculations.

The workload management software/scheduler is SLURM. Please refer to
the user documentation of this software for further information
about SLURM:

[[https://slurm.schedmd.com/documentation.html]]

Full container support is enabled. We recommend the usage of enroot as
container backend for SLURM. If you need a personalized library or OS 
setup around your software, containers will be your way of choice.

As a general rule, realistic resource allocation for your jobs is **mandatory**. You are only allowed to request the resources from the workload manager that you are actually needing to complete your calculations. Wasting precious machine time and making other users wait in line needlessly will lead to revocation of access to the cluster.

===== SLURM partitions =====

At this time, there is only one active partition available. It is called "batch",
the timelimit for jobs in this partition is set to a maximum of 72h and a 
default of 3h.

===== How can I access the cluster? =====

At the moment the cluster is available to all [[https://lamarr-institute.org/about/team/|Principal Investigators]] of the [[https://lamarr-institute.org|Lamarr Institute]]. If you want to gain access, please contact us per [[gsg+lmgpu@informatik.uni-bonn.de|email]] and request an account for you and/or your colleagues. Please do not forget to specify the nature your personal association to the Lamarr Institute, so we can verify the validity your access requests.

As soon as you received the confirmation email from us, you can access the cluster through the login-node using the SSH protocol. All necessary additional information regarding the login will be contained in this email.

===== Example of custom software environments using containers  =====

Obviously you will not get administrative privileges on the cluster to setup your personal software environment. But you can import containers to setup any libraries or frameworks you need to use. We tested the cluster setup using [[https://github.com/NVIDIA/enroot|enroot]] from NVidia, which is integrated into SLURM quite well. Any other container environment supported by [[https://github.com/NVIDIA/pyxis|pyxis]] should work, too, but was not tested by us yet. 

In enroot, you import first import external container images, but those images are immutable. You have to create a container and start it read/write to make persistent changes to it. 

First you import an image (i.e. ubuntu from docker registry):

<code>
$ enroot import -o ~/ubuntu.sqsh docker://ubuntu
</code>

then you create an container from the image (ubuntu.sqsh):

<code>
$ enroot create --name mycontainer ~/ubuntu.sqsh
</code>

Now you start the container and install everything you need:

<code>
$ enroot start -r -w mycontainer
</code>

When you are finished, create a new sqsh Image from your container:

<code>
$ enroot export --output ~/mycontainer.sqsh mycontainer
</code>

Now you can run software via SLURM inside the container like this:

<code>
$ srun --container-image ~/mycontainer.sqsh /script-in-the-container
</code>

Of course you can create your custom image (mycontainer.sqsh) on any
host you like, it does not have to be the login node. You can transfer
the .sqfs file from anywhere to the login node (i.e. via scp/sftp) and 
start your container via srun just like shown above.

===== Shared Storage  =====

If you want to share bigger datasets with other users on the cluster you can request shared storage on the cluster. It will be located under /home/Shared/. Please specify the usernames of all users that should be able to access this data (and any other hepful information for us regarding your request) in an [[gsg+lmgpu@informatik.uni-bonn.de|informal request email]]. We will then create a user group for the specified users, create the directory and set the necessary permissions for you.

===== Requesting GPUs =====

If you want to allocate a certain number of GPUs for your job, use the parameter '--gres=gpu:COUNT' in your srun/squeue calls, where COUNT is the requested number of GPUs for your job.

===== Email Notifications =====

If you want to receive email notifications about the status of your jobs, you can specify a recipient email address at job submission, i.e.:

sbatch --mail-type=ALL --mail-user=someuser@somedomain.com ...

**Please specify a full (with domain!) and valid email address here.**


===== A Note about Interactive Jobs =====

Please remember that interactive jobs often leave things unnecessarily idle (especially if forgotten and/or unterminated), so please think about your impatient colleagues in the queue behind you and try to avoid interactive jobs whenever possible.

===== Exchange with other users =====

There is a (federated) matrix chatroom for all users of the Lamarr GPU Cluster, please feel free to join if you like:

#lmgpu-tech:matrix.informatik.uni-bonn.de