Manifest Examples

Complete manifest.yaml examples. For the full field-by-field specification, see the Manifest Reference.

Per execution runtimes

A minimal complete manifest for each supported framework.

Jax

See the Jax framework page for more info.

Minimal Jax training example with 2 worker pods, each with 1 GPU.

kind: "AIchorManifest"
apiVersion: "0.2.3"

builder:
  image: "my-image"
  dockerfile: "./Dockerfile"
  context: "."

spec:
  operator: "jax"
  image: "my-image"
  command: "python examples/train.py"

  timeout: "6h"

  types:
    worker:
      count: 2
      resources:
        cpus: 16
        memory: 64
        shmSizeGB: 32
        accelerators:
          gpu:
            count: 1
            product: "Tesla-V100-SXM3-32GB"
            type: "gpu"

JobSet

See the JobSet framework page for more info.

JobSet with a master pod plus two CPU-only worker pools.

kind: "AIchorManifest"
apiVersion: "0.2.3"

builder:
  image: "my-image"
  dockerfile: "./Dockerfile"
  context: "."

spec:
  operator: "jobset"
  image: "my-image"
  command: "python examples/train.py"

  timeout: "1d"

  types:
    master:
      count: 1
      completions: 1
      parallelisms: 1
      resources:
        cpus: 16
        ramRatio: 4
        shmSizeGB: 48
        accelerators:
          gpu:
            count: 1
            product: "A100-SXM4-80GB"
            type: "gpu"
    worker-cpu:
      count: 2
      completions: 1
      parallelisms: 1
      resources:
        cpus: 4
        ramRatio: 2
    evaluator:
      count: 1
      completions: 1
      parallelisms: 1
      resources:
        cpus: 2
        memory: 8

KubeRay

See the KubeRay framework page for more info.

KubeRay cluster with a head pod and one GPU worker pool.

kind: "AIchorManifest"
apiVersion: "0.2.3"

builder:
  image: "my-image"
  dockerfile: "./Dockerfile"
  context: "."

spec:
  operator: "kuberay"
  image: "my-image"
  command: "python train.py"
  rayVersion: "2.23.0" # optional

  tensorboard:
    enabled: true

  timeout: "12h"

  types:
    Head:
      resources:
        cpus: 8
        ramRatio: 2
        shmSizeGB: 16
      rayStartParams:
        object-store-memory: "1000000000"  # cap the Ray object store at ~1 GB

    Workers:
      - name: "gpu-workers"
        count: 2
        resources:
          cpus: 16
          memory: 64
          shmSizeGB: 32
          accelerators:
            gpu:
              count: 2
              product: "Tesla-V100-SXM3-32GB"
              type: "gpu"

The image referenced by this manifest must include the ray executable (from the base image or installed as a Python package) and wget. The installed ray version must match spec.rayVersion. For example, in the Dockerfile:

RUN apt-get update && apt-get install -y wget
RUN pip install ray==2.23.0  # must match spec.rayVersion

spec.rayVersion is optional, when omitted it defaults to 2.23.0. See the KubeRay framework page for the full image requirements.

PyTorch

See the PyTorch framework page for more info.

PyTorch distributed training with 4 worker pods, sharing a GPU each.

kind: "AIchorManifest"
apiVersion: "0.2.3"

builder:
  image: "my-image"
  dockerfile: "./Dockerfile"
  context: "."

spec:
  operator: "pytorch"
  image: "my-image"
  command: "python -m torch.distributed.run --nproc_per_node=1 train.py"

  tensorboard:
    enabled: true

  timeout: "8h"

  types:
    worker:
      count: 4
      resources:
        cpus: 12
        memory: 48
        shmSizeGB: 24
        accelerators:
          gpu:
            count: 1
            product: "A100-SXM4-80GB"
            type: "gpu"

XGBoost

See the XGBoost framework page for more info.

XGBoost CPU-only distributed training with 3 worker pods.

kind: "AIchorManifest"
apiVersion: "0.2.3"

builder:
  image: "my-image"
  dockerfile: "./Dockerfile"
  context: "."

spec:
  operator: "xgboost"
  image: "my-image"
  command: "python train_xgboost.py"

  timeout: "2h"

  types:
    worker:
      count: 3
      resources:
        cpus: 8
        memory: 32
        shmSizeGB: 8

Scenarios

Focused snippets demonstrating individual features. The builder section is omitted for brevity. Add it as shown in the per-operator examples above.

TensorBoard

Run a TensorBoard sidecar alongside the experiment.

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
  operator: "jax"
  image: "my-image"
  command: "python examples/train.py"

  tensorboard:
    enabled: true

  types:
    worker:
      count: 1
      resources:
        cpus: 2
        ramRatio: 2

To persist TensorBoard logs across runs, set tensorboard.usePVC.enabled: true and reference a PVC declared under spec.storage.attachExistingPVCs:

spec:
  tensorboard:
    enabled: true
    usePVC:
      enabled: true
      name: "my-tb-pvc"
  storage:
    attachExistingPVCs:
      - name: "my-tb-pvc"
        mountPoint: "/mnt/tensorboard"

Provision a shared volume mounted on every pod of the experiment.

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
  operator: "jax"
  image: "my-image"
  command: "python examples/train.py"

  storage:
    sharedVolume:
      mountPoint: "/mnt/shared"
      sizeGB: 32
      storageClass: "standard-rwx"
      accessMode: "ReadWriteMany"

  types:
    worker:
      count: 2
      resources:
        cpus: 2
        ramRatio: 2

A shared volume lets multiple pods of the same experiment read and write the same data without going through a bucket. The underlying PVC is ephemeral: it is created when the experiment starts and deleted when the experiment ends, so it is suited to scratch data and intermediate results rather than artifacts that need to outlive the run. To keep data after the experiment finishes, write it to a bucket or use an attached PVC instead.

Attached PVC

Mount one or more pre-existing PVCs into the experiment pods.

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
  operator: "jax"
  image: "my-image"
  command: "python examples/train.py"

  storage:
    attachExistingPVCs:
      - name: "my-dataset-pvc"
        mountPoint: "/mnt/dataset"
      - name: "my-checkpoints-pvc"
        mountPoint: "/mnt/checkpoints"

  types:
    worker:
      count: 1
      resources:
        cpus: 4
        ramRatio: 2

Experiment-level environment variables

Inject plain, experiment-scoped environment variables (e.g. hyperparameters) without touching the shared project secret.

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
  operator: "pytorch"
  image: "my-image"
  command: "python train.py"

  env:
    LEARNING_RATE: "0.001"
    BATCH_SIZE: "32"
    NUM_EPOCHS: "100"

  types:
    worker:
      count: 1
      resources:
        cpus: 2
        ramRatio: 2

Values under spec.env are injected as literal environment variables, not references to the project secret. When a key is also defined in the project namespace secret, the spec.env value wins. Protected variables (AICHOR_* and other reserved names) cannot be set this way and are rejected at submission with a validation error.

Experiment-level configurations

Inject experiment level configurations into all experiment pods.

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
  operator: "pytorch"
  image: "my-image"
  command: "python train.py"

  experimentConfig: |
    config1: this-is-a-config
    config2: this-is-another-config
    parent-config:
      nested-config: this-is-nested

  types:
    worker:
      count: 1
      resources:
        cpus: 2
        ramRatio: 2

Skip the image build

Reuse the Docker image from a previous experiment instead of rebuilding, useful for hyperparameter sweeps where the code is unchanged.

kind: "AIchorManifest"
apiVersion: "0.2.3"

builder:
  image: "my-image"
  dockerfile: "./Dockerfile"
  context: "."
  skipBuild:
    enabled: true
    experimentID: "3f2a1b4c-9d8e-4f7a-b6c5-1a2b3c4d5e6f"  # a previous experiment's ID
    failIfNotFound: false

spec:
  operator: "jax"
  image: "my-image"
  command: "python examples/train.py"

  types:
    worker:
      count: 1
      resources:
        cpus: 2
        ramRatio: 2

When skipBuild.enabled is true, the image tagged with experimentID is reused and no rebuild runs. experimentID is required whenever the section is enabled. If the referenced image cannot be found, failIfNotFound controls the outcome: false (the default) falls back to a normal build, while true fails the experiment. Omitting the skipBuild section keeps the default behaviour of building the image on every submission.

Reused image with new environment variables

Combine skipBuild and spec.env to rerun a fixed image while varying hyperparameters per experiment, with no rebuild required.

kind: "AIchorManifest"
apiVersion: "0.2.3"

builder:
  image: "my-image"
  dockerfile: "./Dockerfile"
  context: "."
  skipBuild:
    enabled: true
    experimentID: "3f2a1b4c-9d8e-4f7a-b6c5-1a2b3c4d5e6f"  # a previous experiment's ID
    failIfNotFound: true

spec:
  operator: "pytorch"
  image: "my-image"
  command: "python train.py"

  env:
    LEARNING_RATE: "0.0005"
    BATCH_SIZE: "64"
    NUM_EPOCHS: "200"

  types:
    worker:
      count: 1
      resources:
        cpus: 4
        ramRatio: 2

The image built by experiment 3f2a1b4c-... is reused as-is, so each submission only changes the hyperparameters passed through spec.env. Setting failIfNotFound: true guards the sweep against silently falling back to a fresh build if that image is ever missing.

Extra tolerations

Schedule a worker onto nodes carrying a custom taint (Kubernetes engines).

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
  operator: "jax"
  image: "my-image"
  command: "python examples/train.py"

  types:
    worker:
      count: 1
      resources:
        cpus: 2
        ramRatio: 2
        extraTolerations:
          - key: "dedicated"
            operator: "Equal"
            value: "ml-team"
            effect: "NoSchedule"

Multiple GPUs

Request multiple GPUs on a single worker pod.

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
  operator: "jax"
  image: "my-image"
  command: "python examples/train.py"

  types:
    worker:
      count: 1
      resources:
        cpus: 16
        ramRatio: 3
        shmSizeGB: 48
        accelerators:
          gpu:
            count: 4
            product: "Tesla-V100-SXM3-32GB"
            type: "gpu"

Heterogeneous GPU pools

Run two KubeRay worker pools side by side, each pinned to a different GPU product.

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
  operator: "kuberay"
  image: "my-image"
  command: "python train.py"

  types:
    Head:
      resources:
        cpus: 4
        ramRatio: 2
        shmSizeGB: 8

    Workers:
      - name: "v100-pool"
        count: 2
        resources:
          cpus: 16
          ramRatio: 3
          shmSizeGB: 32
          accelerators:
            gpu:
              count: 1
              product: "Tesla-V100-SXM3-32GB"
              type: "gpu"

      - name: "a100-pool"
        count: 1
        resources:
          cpus: 24
          ramRatio: 4
          shmSizeGB: 48
          accelerators:
            gpu:
              count: 2
              product: "A100-SXM4-80GB"
              type: "gpu"

TPU accelerators

Single-host TPU slice with Jax.

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
  operator: "jax"
  image: "my-image"
  command: "python examples/train.py"

  types:
    worker:
      count: 1
      resources:
        cpus: 90
        ramRatio: 2
        accelerators:
          tpu:
            type: "tpu-v5-lite-podslice"
            topology: "2x2"
            tpuChipsCount: 4

A 2x2 v5e slice with 4 chips per VM resolves to 2x2 / 4 = 1 VM, so count: 1 requests one single-host slice. For multi-host slices use the kuberay operator. See the TPU page for more info.

Spot instances (AWS / Azure Kubernetes Engines)

Schedule workers onto spot (preemptible) capacity on EKS or AKS, and checkpoint on eviction so the experiment can resume after a restart.

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
  operator: "jobset"
  image: "my-image"
  command: "python examples/train.py"

  restartPolicy:
    backoffLimit: 5

  gracefulTermination:
    shutdownCommand: ["python", "scripts/save_checkpoint.py"]
    terminationGracePeriodSeconds: 90

  types:
    worker:
      count: 2
      resources:
        cpus: 4
        ramRatio: 2
        extraSelectors:
          karpenter.sh/capacity-type: "spot"

By default, experiments request karpenter.sh/capacity-type: on-demand.

When a spot node is reclaimed, the cloud provider sends a short interruption notice (e.g., around 120 seconds on AWS) before the pod is killed. gracefulTermination.shutdownCommand runs during that window to persist a checkpoint, and terminationGracePeriodSeconds must be set below the provider's notice period so the command has time to finish. Pairing this with restartPolicy.backoffLimit lets the experiment restart on fresh capacity and resume from the saved checkpoint. The checkpoint should be written to a persistent location, such as an attached PVC or a shared volume, so it survives the pod restart. See the Spot instances page for GCP GKE and recovery details.

Graceful termination

Run a shutdown command before pods are killed (e.g. save a checkpoint).

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
  operator: "jax"
  image: "my-image"
  command: "python examples/train.py"

  gracefulTermination:
    shutdownCommand: ["python", "scripts/save_checkpoint.py"]
    terminationGracePeriodSeconds: 120

  types:
    worker:
      count: 1
      resources:
        cpus: 2
        ramRatio: 2

Specific engine

Pin the experiment to a non-default engine by name (overrides the project's default engine).

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
  operator: "jax"
  image: "my-image"
  command: "python examples/train.py"

  engineName: "my-secondary-cluster"

  types:
    worker:
      count: 1
      resources:
        cpus: 2
        ramRatio: 2

Security context for profiling and debugging

Grant the Linux capabilities required by profilers and debuggers (perfmon, ptrace).

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
  operator: "jax"
  image: "my-image"
  command: "python examples/train.py"

  securityContext:
    perfmon: true
    ptrace: true

  types:
    worker:
      count: 1
      resources:
        cpus: 2
        ramRatio: 2

perfmon grants CAP_PERFMON (needed by profilers such as perf) and ptrace grants CAP_SYS_PTRACE (needed by debuggers and profilers such as gdb and nsys). ptrace is gated by the SYS_PTRACE organisation-level flag, which is disabled by default. When the organisation does not have it enabled the setting is silently ignored, so no capability is granted and no error is raised. Contact the AIchor team to enable it for an organisation.

Security context for a non-root user with a writable shared volume

Run the container under a fixed UID/GID instead of root, and set fsGroup so the non-root process can write to a mounted volume.

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
  operator: "jax"
  image: "my-image"
  command: "python examples/train.py"

  securityContext:
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000

  storage:
    sharedVolume:
      mountPoint: "/mnt/shared"
      sizeGB: 32
      storageClass: "standard-rwx"
      accessMode: "ReadWriteMany"

  types:
    worker:
      count: 2
      resources:
        cpus: 2
        ramRatio: 2

The container runs as UID/GID 1000 rather than root. fsGroup sets the supplemental group that owns the mounted volume, so the non-root process can write to /mnt/shared; without a matching fsGroup, a volume provisioned as root-owned would be read-only to the user. The image must already contain a user with this UID/GID (e.g. created with useradd in the Dockerfile).

Restart policy

Allow the experiment to retry up to 3 times on software failure before being marked failed. For jax, this covers software failures only; for kuberay and jobset, it also covers pod evictions.

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
  operator: "jax"
  image: "my-image"
  command: "python examples/train.py"

  restartPolicy:
    backoffLimit: 3

  types:
    worker:
      count: 1
      resources:
        cpus: 2
        ramRatio: 2

Interactive debugging (VS Code / Cursor)

Start a VS Code tunnel into the running experiment for interactive debugging from a local IDE. Also possible to setup tunnel for Cursor.

kind: "AIchorManifest"
apiVersion: "0.2.3"

spec:
  operator: "jax"
  image: "my-image"
  command: "python examples/train.py"

  debug:
    vscode:
      enabled: true       # required
      path: "/aichor/code" # optional, defaults to "code"; absolute path to the binary if not on $PATH
      provider: "github"   # optional, tunnel provider: "github" or "microsoft" (default: "github")

  types:
    worker:
      count: 1
      resources:
        cpus: 4
        ramRatio: 2
        accelerators:
          gpu:
            count: 1
            product: "A100-SXM4-80GB"
            type: "gpu"

The IDE binary must be present in the experiment's image, and enabling this mode overrides the manifest's command. See Debug Tools for the full setup.

Per execution runtimes​

Jax​

JobSet​

KubeRay​

PyTorch​

XGBoost​

Scenarios​

TensorBoard​

Shared volume for sharing data between pods​

Attached PVC​

Experiment-level environment variables​

Experiment-level configurations​

Skip the image build​

Reused image with new environment variables​

Extra tolerations​

Multiple GPUs​

Heterogeneous GPU pools​

TPU accelerators​

Spot instances (AWS / Azure Kubernetes Engines)​

Graceful termination​

Specific engine​

Security context for profiling and debugging​

Security context for a non-root user with a writable shared volume​

Restart policy​

Interactive debugging (VS Code / Cursor)​