Hardware
The SCC Pilot currently consists of 3 CPU nodes and 1 GPU node. The
raw specs of the Nodes are:
CPU Nodes
- 2x AMD EPYC 7513 32-Core Processor with Hyperthreading enabled
- 1TB RAM
GPU Node
- 2x AMD EPYC 7313 16-Core Processor with Hyperthreading enabled
- 250GB RAM
- 1x NVIDIA A10
Virtualization
Because this is a test system, we need to have some flexibility to test out
different scenarios. Furthermore, we need to install some more services
(like databases, controllers etc.). This is why we decided to fully virtualize the
system using Proxmox and KVM. This is why the system is presented to users
as 8 CPU Nodes with 60 cores and 400GB RAM each and 2 GPU Nodes with 30 cores
and 90GB RAM each.
Storage
The SCC Pilot uses a Ceph cluster with 150TB net capacity for storage.
Currently, the only way to access the storage is via your home
folder. So,
everything that is stored in /home/USERNAME
is stored on the Ceph cluster and
thus available on all nodes.
The SCC Pilot stores everything on HDDs which makes the system very slow
if you work with lots of small files. In particular, writing many small files
is very slow. Unfortunately, there is nothing we can do about this.
The SCC will have a much faster storage system.
Please be aware that there is no backup of your data. The storage is redundant
so if one harddrive breaks, your data is still there. However:
- If you delete a file or modify it, there is no way of undoing it.
- If the filesystem gets corrupted or too many harddrives fail, there can be data loss.
Although these scenarios are unlikely, they can happen at any time.
So treat your data on this storage in a way that it can be gone tomorrow!
Slurm Partitions and QoS
Slurm has a very sophisticated way of managing priorities. In the case of the
SCC Pilot, we currently use so-called "Quality of Service" (QoS) to group
different job profiles so that all use cases are covered.
In particular, this is our strategy:
- By default, provide as many ressources as possible.
- Do not let one user block the whole cluster for more than a day.
- Provide interactive job.
We therefore currently use 3 QoS which differ in:
- The maximum time a job can run (i.e. Walltime)
- The amount of jobs a user can have in the QoS
- The amount of ressources a user can use in the QoS
Currently, we have these QoS:
Priority |
Name |
Limits |
Purpose |
1 |
interactive |
Walltime: 12 hours Maximum Jobs per User: 2 |
For interactive jobs |
2 |
high_prio |
Walltime: 4 hours Maximum Cores per User: 130 |
For short, high-priority jobs |
3 |
normal |
Walltime: 14 days |
Default QoS |
Jobs in a QoS with a higher priority run earlier than those submitted to a lower priority QoS.
Additionally, jobs in a QoS with a higher priority can preempt jobs in a lower priority QoS.
"Preempt" means that the jobs stops and gets put in the queue.
By default, all jobs are submitted to the normal
queue. So you can run as many
as you like for up to 2 weeks.
If you are annoyed by someone else doing just that and want to run short jobs (4 hours or less),
you can use the high_prio
QoS. However, you are going to be limited to ~50% of
the cores.
The interactive
QoS is for jobs that you want to run immediately. Like the
xfce session or the VS Code tunnel.
You can use the --qos
switch when you use sbatch
or srun
to specify the
QoS you want to use. Both plus-slurm and plus-slurm-matlab have respective
parameters.