skip to primary navigationskip to content

Cambridge Service for Data Driven Discovery

University of Cambridge

Studying at Cambridge

 

Long jobs

This page describes how suitable applications requiring unusually long execution times can be run on CSD3.

Definition of long jobs

We define long jobs as jobs requiring wall times (i.e. real execution times) of up to 7 days. Continuous execution times of these lengths are normally disallowed by both non-paying and paying qualities of service (QoS), in order to achieve reasonable overall throughput.

In general it is advisable for any application running for extended periods to have the ability to save its progress (i.e. to checkpoint) as insurance against unexpected failures that may result in wastage of significant resources. Applications for which it is possible to checkpoint are largely immune from per-job runtime limits as they can simply resume from the most recent checkpoint in the guise of a new job. Applications for  which it is not feasible to checkpoint may find the scheduling features described below to be of use.

Note on checkpointing

Note that CSD3 nodes have Berkeley Lab Checkpoint/Restart (BLCR) enabled by default. This may provide the possibility of checkpointing through SLURM for some applications which do not have their own support for this - however not all jobs will work with BLCR successfully (in particular, we don't recommend trying SLURM/BLCR checkpointing with MPI jobs). Nevertheless, some non-parallel jobs may be able to use BLCR to accumulate extended run times without needing to request one of the special QoS described below.

The QOSL QoS

Paying users with suitable applications may be granted access to the QOSL quality of service, which permits jobs running for up to 7 days. Jobs associated with this special QoS are confined to -long variants of the usual partitions (skylake-long, knl-long, pascal-long). It is expected that users wishing to run for a long time are prepared to let others do so too while waiting for a start time.

QOSL is implemented by three SLURM QoS definitions, one per cluster. Peta4-Skylake has cpul, which is restricted to 640 cpus per user. Peta4-KNL has knll, which is restricted to 64 nodes per user. Wilkes2-GPU has gpul, which is restricted to 32 GPUs per user.

In order to apply for access to QOSL, please email support'at'hpc.cam.ac.uk detailing why this mode of usage is necessary, and explaining why checkpointing is not a practical option for your application.

Submitting long jobs

Use of QOSL is tied to the -long partitions, therefore once given access it is necessary only to specify this partition - e.g.

sbatch -t 7-0:0:0 -p skylake-long -A YOUR_PROJECT ...