SCC Pilot Details

Hardware

The SCC Pilot currently consists of 3 CPU nodes and 1 GPU node. The raw specs of the Nodes are:

CPU Nodes

  • 2x AMD EPYC 7513 32-Core Processor with Hyperthreading enabled
  • 1TB RAM

GPU Node

  • 2x AMD EPYC 7313 16-Core Processor with Hyperthreading enabled
  • 250GB RAM
  • 1x NVIDIA A10

Virtualization

Because this is a test system, we need to have some flexibility to test out different scenarios. Furthermore, we need to install some more services (like databases, controllers etc.). This is why we decided to fully virtualize the system using Proxmox and KVM. This is why the system is presented to users as 8 CPU Nodes with 60 cores and 400GB RAM each and 2 GPU Nodes with 30 cores and 90GB RAM each.

Storage

The SCC Pilot uses a Ceph cluster with 150TB net capacity for storage. Currently, the only way to access the storage is via your home folder. So, everything that is stored in /home/USERNAME is stored on the Ceph cluster and thus available on all nodes.

Small files take a long time

The SCC Pilot stores everything on HDDs which makes the system very slow if you work with lots of small files. In particular, writing many small files is very slow. Unfortunately, there is nothing we can do about this.

The SCC will have a much faster storage system.

No Backup

Please be aware that there is no backup of your data. The storage is redundant so if one harddrive breaks, your data is still there. However:
  • If you delete a file or modify it, there is no way of undoing it.
  • If the filesystem gets corrupted or too many harddrives fail, there can be data loss.
Although these scenarios are unlikely, they can happen at any time. So treat your data on this storage in a way that it can be gone tomorrow!

Slurm Partitions and QoS

Slurm has a very sophisticated way of managing priorities. In the case of the SCC Pilot, we currently use so-called "Quality of Service" (QoS) to group different job profiles so that all use cases are covered.

In particular, this is our strategy:

  • By default, provide as many ressources as possible.
  • Do not let one user block the whole cluster for more than a day.
  • Provide interactive job.

We therefore currently use 3 QoS which differ in:

  • The maximum time a job can run (i.e. Walltime)
  • The amount of jobs a user can have in the QoS
  • The amount of ressources a user can use in the QoS

Currently, we have these QoS:

Priority Name Limits Purpose
1 interactive Walltime: 12 hours
Maximum Jobs per User: 2
For interactive jobs
2 high_prio Walltime: 4 hours
Maximum Cores per User: 130
For short, high-priority jobs
3 normal Walltime: 14 days Default QoS

Jobs in a QoS with a higher priority run earlier than those submitted to a lower priority QoS. Additionally, jobs in a QoS with a higher priority can preempt jobs in a lower priority QoS. "Preempt" means that the jobs stops and gets put in the queue.

By default, all jobs are submitted to the normal queue. So you can run as many as you like for up to 2 weeks.

If you are annoyed by someone else doing just that and want to run short jobs (4 hours or less), you can use the high_prio QoS. However, you are going to be limited to ~50% of the cores.

The interactive QoS is for jobs that you want to run immediately. Like the xfce session or the VS Code tunnel.

You can use the --qos switch when you use sbatch or srun to specify the QoS you want to use. Both plus-slurm and plus-slurm-matlab have respective parameters.