Scaling out¶
A single Purdue AF session is limited to 128 CPU cores and 128 GB RAM. When your analysis outgrows these resources, several options are available. This page gives an overview; detailed instructions are linked from each section.
Which method should I use?¶
| Method | Best for | Available to | Typical scale |
|---|---|---|---|
| Dask (local cluster) | Parallelizing Python code within a session | All users | up to 128 cores |
| Dask Gateway, Kubernetes backend | Distributed Python / Coffea analyses | All users | up to 200 workers (200 cores, 1.2 TB RAM) |
| Dask Gateway, Slurm backend | Distributed Python / Coffea analyses | Purdue users | hundreds of workers (Hammer / Gautschi) |
| Slurm batch jobs | Independent batch workloads, GPU jobs | Purdue users | Hammer cluster (cms account) or other Purdue Community Clusters |
| CRAB | CMSSW (cmsRun) jobs, MC generation, skimming |
All CMS users | the entire WLCG |
Dask¶
Dask is an open-source library for parallel computing in Python. It can be used to quickly parallelize any Python code, or implicitly as a backend in frameworks such as Coffea and RDataFrame.
- A local Dask cluster parallelizes your code over the cores of your own session — no extra setup required.
- Dask Gateway scales beyond the session, submitting workers either as Kubernetes pods on the Geddes cluster (all users), or as Slurm jobs on the Hammer or Gautschi clusters (Purdue users only). Note that each user can have at most one active Dask Gateway cluster per gateway at a time.
Slurm (Purdue users only)¶
Slurm is a job scheduler and
workload manager that enables batch submission on Purdue computing clusters.
At Purdue AF, users with local Purdue accounts can submit jobs from the AF
terminal to the Hammer cluster, using the cms Slurm account. Users can also submit Slurm jobs at other Community Clusters after logging into them via ssh.
- Instructions for submitting Slurm jobs
- Code and data used by Slurm jobs must be stored on Depot (
/depot/cms/) —/home/and/work/are not mounted in Slurm jobs (see Storage volumes). - To request a GPU for a Slurm job, add
--gpus-per-node=1to thesbatchcommand — see GPU access.
CRAB¶
CRAB (CMS Remote Analysis Builder) is a utility to submit CMSSW jobs to distributed computing resources. CRAB allows you to:
- access Data and Monte Carlo datasets stored at any CMS computing site worldwide;
- exploit the CPU and storage resources of CMS computing sites via the Worldwide LHC Computing Grid (WLCG).
CRAB is suitable for running most CMSSW framework jobs (i.e. jobs launched via the
cmsRun command). It is recommended for computationally intensive workloads such
as Monte Carlo generation or "skimming" AOD / MiniAOD datasets.
- Instructions for submitting CRAB jobs
- CRAB outputs are written to your Grid directory at Purdue EOS:
/eos/purdue/store/user/<cern-username>— see Storage volumes.
Monitoring your jobs¶
Slurm and Dask metrics are available in the corresponding sections of the Purdue AF monitoring dashboard. Each Dask Gateway cluster additionally gets its own Dask dashboard — see Dask Gateway monitoring.