Skip to main content

Manifest Reference

Complete field-by-field reference for manifest.yaml.

See the Manifest Examples page for complete, per-operator manifest.yaml files.

kindstringrequiredAlways AIchorManifest

Always AIchorManifest.

apiVersionstringrequiredSchema version — current: 0.2.3

Schema version. Set to 0.2.3 for new manifests.

namestringoptionalHuman-readable manifest name

Name for this manifest.

builderobjectrequired
buildArgsmapoptionalDocker build arguments

Key/value map of Docker build arguments (passed as --build-arg). Values may be strings or booleans.

buildArgs:
PYTHON_VERSION: "3.11"
USE_CUDA: true
contextstringoptionalDocker build context path

Docker build context path.

Default: "."

dockerfilestringoptionalPath to the Dockerfile

Path to the Dockerfile, relative to the repository root (e.g. ./Dockerfile).

imagestringoptionalDocker image name

Name of the Docker image to build and use. Must match spec.image.

registrystringoptionalDocker registry prefix

Registry prefix used to push and pull the image (e.g. europe-west4-docker.pkg.dev/my-project/my-repo).

targetstringoptionalDockerfile target stage

Target stage for multi-stage Dockerfiles (passed as --target).

cloneobjectoptional
depthintegeroptionalShallow clone depth

Number of commits to fetch. 1 produces a shallow clone. Increase when git log history is needed. Minimum: 1.

Default: 1

includeDotGitbooleanoptionalKeep .git directory in clone

When true, the .git directory is kept in the cloned repository. Required for submodule initialisation.

Default: false

submodulesstringoptionalSubmodule init mode

Controls submodule initialisation. Only processed when includeDotGit is true.

ValueBehaviour
"false"No submodules cloned
"true"Non-recursive clone
"recursive"Recursive clone

Default: "false"

specobjectrequired
activeDeadlineSecondsintegeroptionalMax experiment duration in seconds

Maximum running duration for the experiment in seconds. Superseded by timeout when both are set.

commandstringrequiredCommand to run in the container

The command or script to execute inside the container.

spec:
command: "python3 -u main.py"
debugobject
jupyterobjectoptionalJupyter server — kuberay only

Starts a Jupyter server on the Ray head node. Only available when operator is kuberay.

enabledbooleanoptionalEnable the Jupyter server

Whether to start the Jupyter server.

Default: false

pathstringoptionalPath to Jupyter binary

Path to the Jupyter binary. Defaults to $PATH resolution.

vscodeobjectoptionalVSCode tunnel — all operators

Attaches a VSCode tunnel to the experiment. Available for all operators.

enabledbooleanoptionalEnable the VSCode tunnel

Whether to start the VSCode tunnel.

Default: false

pathstringoptionalPath to VSCode CLI binary

Path to the VSCode CLI binary. Defaults to $PATH resolution.

providerstringoptionalTunnel auth provider

VSCode tunnel authentication provider. Values: "github", "microsoft".

Default: "github"

engineNamestringoptionalTarget engine name

Name of the engine to run the experiment on. When omitted, the project's default engine is used. Useful in multi-engine projects.

envmapoptionalExperiment-level environment variables

Key/value map of environment variables scoped to this experiment. Each entry is injected into the containers as a plain literal value, which makes env suited to per-run values such as hyperparameters without touching the shared project configuration.

env:
LEARNING_RATE: "0.001"
BATCH_SIZE: "32"
NUM_EPOCHS: "100"

When the same key is also defined as a project-level variable, the spec.env value takes precedence. Predefined variables — the AICHOR_* variables and other protected names — cannot be set here; attempting to do so fails validation when the experiment is submitted. See Environment Variables for the full list.

extraLabelsmapoptionalExtra Kubernetes pod labels

Additional Kubernetes labels applied to the experiment pods.

extraLabels:
team: research
project: my-project
gracefulTerminationobject

Runs a command when a pod is being terminated, so the experiment can shut down cleanly, most commonly to save a checkpoint before a spot instance is reclaimed.

The command is installed as a Kubernetes preStop hook. On termination (a spot-instance reclaim, a cancel, or any pod deletion) it runs first; once it returns, the container receives SIGTERM. The command therefore does not need to trap signals itself.

When this block is present, both shutdownCommand and terminationGracePeriodSeconds are required, and the grace period must be greater than 0. When the block is omitted entirely, no shutdown command runs and Kubernetes applies its own default grace period of 30 seconds.

gracefulTermination:
shutdownCommand: ["python", "save_checkpoint.py"]
terminationGracePeriodSeconds: 60
Spot instances

Cloud providers give only a short interruption notice for spot/preemptible instances. For example, AWS offers around 120 seconds. On a spot reclaim the node is drained within that window, so a terminationGracePeriodSeconds larger than the provider's notice will not be fully honoured. Keep the shutdown command well within ~120 seconds for checkpoints to complete reliably. On a normal cancel or deletion the full grace period is honoured. For more information please check the documentation of your cloud provider

shutdownCommandarray[string]requiredCommand run before the pod is killed

Command to execute inside the container before termination (e.g. to save a checkpoint). Provided as an argument vector — the executable followed by its arguments.

shutdownCommand: ["python", "save_checkpoint.py"]
terminationGracePeriodSecondsintegerrequiredGrace period before force-kill

How long, in seconds, Kubernetes waits for the shutdown sequence (the shutdownCommand followed by SIGTERM handling) to complete before force-killing the pod with SIGKILL. Must be greater than 0.

The countdown starts when termination begins, so this value must cover the full time the shutdownCommand needs to run.

imagestringrequiredDocker image to run

The Docker image to run. Must match builder.image.

operatorstringrequiredWorkload operator

The operator that manages the workload.

ValueDescription
kuberayRay cluster managed by KubeRay
jaxJAX distributed training
pytorchPyTorch distributed training
xgboostXGBoost distributed training
jobsetGeneric Kubernetes JobSet
restartPolicyobject
backoffLimitintegeroptionalMax restarts before failure

Number of allowed restarts before the experiment is marked as failed. Minimum: 0.

On jobset and kuberay, this covers both eviction (spot preemption) and software failures. On other operators it covers only software failures.

Default: 0

securityContextobject
spec:
securityContext:
runAsUser: 1000 # run as a non-root user...
runAsGroup: 1000
fsGroup: 1000 # ...and give that group ownership of mounted volumes
storage:
sharedVolume:
mountPoint: /mnt/shared
sizeGB: 50
accessMode: ReadWriteMany
storageClass: standard-rwo
fsGroupintegeroptionalGID applied to mounted volumes

Sets a supplementary group ID (GID) that owns all volumes mounted into the pod. Kubernetes changes the group ownership of mounted volumes to this GID and adds it to every process in the pod, so files created on those volumes are group-owned and group-writable by this GID.

This is mainly useful when the container does not run as root. By default a mounted volume (such as a PVC or a shared volume) is owned by root, so a non-root process cannot write to it. Setting fsGroup makes the volume writable by the chosen group, and lets several containers in the same pod share the same files.

In the example at the top of this section, the process runs as UID 1000 and can read and write /mnt/shared because the volume is group-owned by GID 1000. Without fsGroup, that same non-root process would be denied write access to the volume.

perfmonbooleanoptionalGrant CAP_PERFMON

Grants CAP_PERFMON to allow performance-monitoring syscalls. Required by profilers such as perf.

Default: false

ptracebooleanoptionalGrant CAP_SYS_PTRACE

Grants CAP_SYS_PTRACE to allow process tracing. Required by debuggers and profilers such as gdb and nsys.

This capability is gated by an organisation-level flag, which is disabled by default. When the organisation does not have it enabled, the setting is silently ignored (no CAP_SYS_PTRACE is granted and no error is raised). Contact the AIchor team to have it enabled for an organisation.

Default: false

runAsGroupintegeroptionalPrimary GID for all processes

Sets the primary group ID (GID) that every process in the container runs as, overriding the group defined in the image. This controls the group ownership of files the processes create and which group-restricted files they can read. It is the group counterpart to runAsUser and is commonly paired with it (and with fsGroup) to run as a specific non-root identity.

runAsUserintegeroptionalUID for all container processes

Sets the user ID (UID) that every process in the container runs as, overriding the user defined in the image's Dockerfile. Use it to run as a specific non-root user.

A value of 0 means root. When omitted, the user baked into the image is used. To write to mounted volumes as a non-root user, combine runAsUser with fsGroup so the volumes are group-writable.

storageobject

Attaches volumes to every pod in the experiment. Use attachExistingPVCs to mount Persistent Volume Claims that already exist in the project, and sharedVolume to have AIchor provision a temporary volume shared by all pods for the duration of the run.

spec:
storage:
attachExistingPVCs:
- name: datasets-pvc
mountPoint: /mnt/datasets
sharedVolume:
mountPoint: /mnt/shared
sizeGB: 100
accessMode: ReadWriteMany
storageClass: standard-rwo
attachExistingPVCsarray[object]optionalAttach existing PVCs to all pods

List of existing Persistent Volume Claims to attach to all pods.

[].mountPointstringrequiredMount path inside the container

Mount path inside the container.

[].namestringrequiredPVC name

Name of the PVC to attach.

sharedVolumeobjectoptionalEphemeral shared volume across all pods

AIchor creates a temporary shared volume, mounts it on all pods, and deletes it when the experiment ends.

accessModestringrequiredKubernetes PV access mode

Kubernetes PV access mode, e.g. ReadWriteOnce or ReadWriteMany.

mountPointstringrequiredMount path inside the container

Mount path inside the container (e.g. /mnt/shared).

sizeGBintegerrequiredVolume size in GiB

Volume size in gigabytes.

Storage classes enforce a minimum size

Each cloud provider applies a minimum volume size per storage class. A request below that floor is rejected or silently rounded up by the CSI driver, and billing is based on the provisioned size, not the requested size. The value of sizeGB must respect the minimum of the chosen storageClass.

Common minimums:

Storage classProviderMinimum sizeGB
gp2, gp3 (EBS)AWS1
io1, io2 (EBS)AWS4
pd-standard, pd-balanced, pd-ssd (Persistent Disk)GCP10
standard / premium (Filestore, NFS)GCP1024 / 2560

For example, requesting sizeGB: 1 against a pd-balanced storage class on GCP still provisions a 10 GiB disk, and a Filestore-backed class is billed for at least 1 TiB regardless of the requested value.

storageClassstringrequiredKubernetes storage class

Kubernetes storage class name (e.g. gp2, standard-rwo, ceph-rbd, cephfs, longhorn).

tensorboardobject
enabledbooleanrequiredEnable TensorBoard

Whether to run a TensorBoard sidecar for the experiment. When enabled, the TensorBoard dashboard is accessible from the experiment's page.

usePVCobjectoptionalRead logs from a PVC

When set, TensorBoard reads logs from a PVC instead of object storage. Requires spec.storage.attachExistingPVCs to be configured.

enabledbooleanrequiredUse PVC for TensorBoard logs

Whether to mount a PVC for TensorBoard log storage.

namestringoptionalPVC name from attachExistingPVCs

Name of a PVC from spec.storage.attachExistingPVCs. When omitted, the first PVC in the list is used.

timeoutstringoptionalMax duration — e.g. "1d 6h"

Maximum duration as a human-readable string. Takes precedence over activeDeadlineSeconds.

Accepts combinations of w (weeks), d (days), h (hours), m (minutes), s (seconds).

timeout: "1w 2d 5h 6m 20s"
typesobjectrequired

Defines worker pools — required field names depend on the operator. Select an operator below to see the structure of spec.types for that operator.

Across all operators, spec.types is required and contains the worker-pool specifications. The exact shape depends on the operator:

  • jax, pytorch, xgboost — single worker pool.
  • jobset — named replicated jobs (free-form job keys).
  • kuberay — one Head pod plus one or more Workers pools. For all Kubernetes-based operators (jax, pytorch, xgboost, jobset, kuberay), each pool's resources accepts the same fields — select an operator tab to see the full schema inline.