Skip to content

Scaling out

A single Purdue AF session is limited to 128 CPU cores and 128 GB RAM. When your analysis outgrows these resources, several options are available. This page gives an overview; detailed instructions are linked from each section.

Which method should I use?

Method Best for Available to Typical scale
Dask (local cluster) Parallelizing Python code within a session All users up to 128 cores
Dask Gateway, Kubernetes backend Distributed Python / Coffea analyses All users up to 200 workers (200 cores, 1.2 TB RAM)
Dask Gateway, Slurm backend Distributed Python / Coffea analyses Purdue users hundreds of workers (Hammer / Gautschi)
Slurm batch jobs Independent batch workloads, GPU jobs Purdue users Hammer cluster (cms account) or other Purdue Community Clusters
CRAB CMSSW (cmsRun) jobs, MC generation, skimming All CMS users the entire WLCG

Dask

Dask is an open-source library for parallel computing in Python. It can be used to quickly parallelize any Python code, or implicitly as a backend in frameworks such as Coffea and RDataFrame.

  • A local Dask cluster parallelizes your code over the cores of your own session — no extra setup required.
  • Dask Gateway scales beyond the session, submitting workers either as Kubernetes pods on the Geddes cluster (all users), or as Slurm jobs on the Hammer or Gautschi clusters (Purdue users only). Note that each user can have at most one active Dask Gateway cluster per gateway at a time.

Slurm (Purdue users only)

Slurm is a job scheduler and workload manager that enables batch submission on Purdue computing clusters. At Purdue AF, users with local Purdue accounts can submit jobs from the AF terminal to the Hammer cluster, using the cms Slurm account. Users can also submit Slurm jobs at other Community Clusters after logging into them via ssh.

CRAB

CRAB (CMS Remote Analysis Builder) is a utility to submit CMSSW jobs to distributed computing resources. CRAB allows you to:

  • access Data and Monte Carlo datasets stored at any CMS computing site worldwide;
  • exploit the CPU and storage resources of CMS computing sites via the Worldwide LHC Computing Grid (WLCG).

CRAB is suitable for running most CMSSW framework jobs (i.e. jobs launched via the cmsRun command). It is recommended for computationally intensive workloads such as Monte Carlo generation or "skimming" AOD / MiniAOD datasets.

Monitoring your jobs

Slurm and Dask metrics are available in the corresponding sections of the Purdue AF monitoring dashboard. Each Dask Gateway cluster additionally gets its own Dask dashboard — see Dask Gateway monitoring.