Skip to main content

JobSet

JobSet workloads on AIchor are managed by JobSet, a Kubernetes-native CRD maintained that lets you run a group of Jobs as a single distributed workload. It is the best fit when an experiment is made of several different job groups (for example a master plus a few worker pools, each with its own resources) instead of a single fixed master/worker layout.

How to use

Select JobSet by setting spec.operator: jobset in your manifest. Under spec.types declare one entry per job group; the name of each entry is up to you.

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
operator: "jobset"
image: "image"
command: "python3 -u main.py"

types:
master:
count: 1
resources:
cpus: 1
ramRatio: 2


worker:
count: 5
resources:
cpus: 4
ramRatio: 3
accelerators:
gpu:
count: 1
type: "gpu"
product: "NVIDIA-A100-SXM4-80GB"

worker-2:
count: 1
resources:
cpus: 4
ramRatio: 3

For the full field-by-field specification, see the Manifest Reference. For more complete examples, see the Manifest Examples.

Available types

Under spec.types any name can be set for the job groups, as long as it matches ^[a-z0-9-]+$ — lowercase letters, digits and hyphens only, and the name cannot start with a digit or hyphen. At least one job group must be defined.

Three fields control the shape of each job group:

  • count — number of jobs in this group. In the example above, worker has count: 5, meaning 5 independent Kubernetes Jobs are created for the worker group.
  • completions — number of pods inside each job. With worker.count: 5 and worker.completions: 4, the worker group runs 5 × 4 = 20 pods in total. For worker.resources.cpu: 4 and worker.resources.ramRation: 3, this setup will use 5*4=20 cpus and 5*4*3=60GB of RAM.
  • parallelisms — how many of those pods run at the same time inside each replica. With worker-2.parallelisms: 1, pods run one after the other: when one finishes, the next starts, until completions is reached.

Injected environment variables

The JobSet CRD natively injects:

JOB_COMPLETION_INDEX

AIchor also adds the following variables to every pod:

The example values below assume the master + worker + worker-2 manifest above (count of 1, 5, and 1):

VariableDescriptionExample
JOB_COMPLETION_INDEXThe pod position inside its job replica2 (pods worker-0-2 and worker-3-2)
JOB_INDEXThe replica index of the job this pod belongs to1 (pods worker-1-2 and worker-1-3)
GLOBAL_REPLICASTotal number of job replicas across all types.7 (1 master + 5 worker + 1 worker-2)
JOB_GLOBAL_INDEXIndex of this pod's job replica among all job replicas across all types, from 0 to GLOBAL_REPLICAS - 1.06
REPLICATED_JOB_NAMEThe type name this pod belongs to.master, worker, or worker-2
REPLICATED_JOB_REPLICASTotal number of job replicas for this pod's type.1 (master), 5 (worker), 1 (worker-2)

Examples

One group, one job

spec:
operator: "jobset"
command: "python3 main.py"
types:
worker:
count: 1

One Kubernetes Job is created with one pod. The pod is named worker-0-0.

time      ───────────────────────────────►

worker 0 [──────────────────────────────] worker-0-0

One group, multiple jobs

spec:
operator: "jobset"
command: "python3 main.py"
types:
worker:
count: 4

Four Kubernetes Jobs are created, each with one pod. All four pods run concurrently.

PodJob indexPod index
worker-0-000
worker-1-010
worker-2-020
worker-3-030
time      ───────────────────────────────►

worker 0 [──────────────────────────────] worker-0-0
worker 1 [──────────────────────────────] worker-1-0
worker 2 [──────────────────────────────] worker-2-0
worker 3 [──────────────────────────────] worker-3-0

Two groups

spec:
operator: "jobset"
command: "python3 main.py"
types:
master:
count: 1
worker:
count: 4

Five pods run concurrently: one master (master-0-0) and four workers (worker-0-0 through worker-3-0).

time      ───────────────────────────────►

master [──────────────────────────────] master-0-0
worker 0 [──────────────────────────────] worker-0-0
worker 1 [──────────────────────────────] worker-1-0
worker 2 [──────────────────────────────] worker-2-0
worker 3 [──────────────────────────────] worker-3-0

Two groups, multiple pods per job (2 pods per job concurrent)

spec:
operator: "jobset"
command: "python3 main.py"
types:
master:
count: 1
worker:
count: 4
completions: 2
parallelisms: 2

Nine pods run concurrently. parallelisms: 2 is required here — without it, parallelisms defaults to 1 and each job would run its pods one at a time.

GroupJobsPods per jobTotal pods
master111
worker428

Worker pod names: worker-0-0, worker-0-1, worker-1-0, worker-1-1, worker-2-0, worker-2-1, worker-3-0, worker-3-1.

time      ───────────────────────►

master [──────────────────────] master-0-0
worker 0 [──────────────────────] worker-0-0
[──────────────────────] worker-0-1
worker 1 [──────────────────────] worker-1-0
[──────────────────────] worker-1-1
worker 2 [──────────────────────] worker-2-0
[──────────────────────] worker-2-1
worker 3 [──────────────────────] worker-3-0
[──────────────────────] worker-3-1

Two groups, sequential pods per job

spec:
operator: "jobset"
command: "python3 main.py"
types:
master:
count: 1
worker:
count: 4
completions: 4
parallelisms: 1

Each worker job runs its pods one at a time: worker-N-0 completes, then worker-N-1 starts, and so on. At any point during the experiment, 5 pods are active: the master and one pod per worker job. 16 worker pods execute in total across the lifetime of the experiment (4 jobs × 4 sequential pods each).

time      ──────────────────────────────────────────────►

master [──────────────────────────────────────────]
worker 0 [──────] [──────] [──────] [──────]
worker 1 [──────] [──────] [──────] [──────]
worker 2 [──────] [──────] [──────] [──────]
worker 3 [──────] [──────] [──────] [──────]
pod 0 pod 1 pod 2 pod 3

Embarrassingly parallel (hyperparameter sweep)

No coordinator needed. Each job is fully independent and uses JOB_INDEX to select its hyperparameters.

spec:
operator: "jobset"
command: "python3 train.py"
types:
trial:
count: 16
resources:
cpus: 4
ramRatio: 2
accelerators:
gpu:
count: 1
type: "gpu"
product: "NVIDIA-A100-SXM4-80GB"

16 independent training jobs run concurrently. Each pod reads JOB_INDEX (0–15) to pick its hyperparameter configuration.

time      ───────────────────────────────►

trial 0 [──────────────────────────────]
trial 1 [──────────────────────────────]
trial 2 [──────────────────────────────]
...
trial 15 [──────────────────────────────]

Heterogeneous pools

A coordinator manages a GPU worker pool and a CPU worker pool with different resources. JobSet treats them as a single experiment.

spec:
operator: "jobset"
command: "python3 main.py"
types:
coordinator:
count: 1
resources:
cpus: 4
ramRatio: 2
gpu-worker:
count: 4
resources:
cpus: 8
ramRatio: 4
accelerators:
gpu:
count: 1
type: "gpu"
product: "NVIDIA-A100-SXM4-80GB"
cpu-worker:
count: 8
resources:
cpus: 16
ramRatio: 4

13 pods run concurrently. JOB_GLOBAL_INDEX and REPLICATED_JOB_NAME let each pod know its role and position.

time            ───────────────────────────►

coordinator 0 [──────────────────────────]
gpu-worker 0 [──────────────────────────]
gpu-worker 1 [──────────────────────────]
gpu-worker 2 [──────────────────────────]
gpu-worker 3 [──────────────────────────]
cpu-worker 0 [──────────────────────────]
cpu-worker 1 [──────────────────────────]
...
cpu-worker 7 [──────────────────────────]

Sequential batch processing

A single job processes 100 batches with up to 4 running at a time. No distributed coordination — pure throughput.

spec:
operator: "jobset"
command: "python3 process.py"
types:
processor:
count: 1
completions: 100
parallelisms: 4
resources:
cpus: 4
ramRatio: 2

One Kubernetes Job runs 100 pods in waves of 4. Each pod reads JOB_COMPLETION_INDEX (0–99) to know which batch to process.

time       ────────────────────────────────────────────────────────────────────────►

processor [p-0][p-1][p-2][p-3] [p-4][p-5][p-6][p-7] ... [p-96][p-97][p-98][p-99]

Mixed concurrency

One group runs fully parallel pods (fast pre-processing), another runs sequential pods (slower evaluation). Both groups run concurrently with each other.

spec:
operator: "jobset"
command: "python3 main.py"
types:
preprocessor:
count: 4
completions: 4
parallelisms: 4
resources:
cpus: 4
ramRatio: 2
evaluator:
count: 4
completions: 4
parallelisms: 1
resources:
cpus: 8
ramRatio: 4
accelerators:
gpu:
count: 1
type: "gpu"
product: "NVIDIA-A100-SXM4-80GB"

Preprocessor jobs run all 4 pods at once. Evaluator jobs run one pod at a time, reusing the same GPU slot for each sequential evaluation.

time            ─────────────────────────────────────►

preprocessor 0 [────][────][────][────]
preprocessor 1 [────][────][────][────]
preprocessor 2 [────][────][────][────]
preprocessor 3 [────][────][────][────]
evaluator 0 [──────] [──────] [──────] [──────]
evaluator 1 [──────] [──────] [──────] [──────]
evaluator 2 [──────] [──────] [──────] [──────]
evaluator 3 [──────] [──────] [──────] [──────]
pod 0 pod 1 pod 2 pod 3

Scheduling behaviour

Kubernetes schedules pods independently. If only one GPU is available when the experiment is submitted, one pod starts immediately; when it finishes, the next pending pod is scheduled, and so on. For embarrassingly parallel or sequential workloads this is the expected behaviour.

For synchronous workloads, this causes a problem: pods that start before their peers attempt to connect to the others and find them unavailable, consuming resources while idle or retrying.

JobSet has no native mechanism to wait for all pods to be ready before starting. See the Kubernetes Scheduling page for how to handle this, including Python examples for both a decentralized pattern (every pod waits for all others) and a coordinator pattern (workers wait for the coordinator, coordinator waits for all workers).

The 2D job/pod model

JobSet pods can be seen as a 2D array. A pod's name has the shape:

xxxxx-name-a-b-xxxxxx

Where:

  • name is the type name you chose (e.g. master, worker, worker-2).
  • a is the job index, from 0 to count - 1.
  • b is the pod index inside that job, from 0 to completions - 1.

alt text

The diagram above visualises this layout: each row corresponds to one job (a axis), each column to one pod position inside the replica (b axis).

Documentation and demo project