Manifest Reference

Complete field-by-field reference for manifest.yaml.

See the Manifest Examples page for complete, per-operator manifest.yaml files.

kindstringrequiredAlways AIchorManifest

Always AIchorManifest.

apiVersionstringrequiredSchema version — current: 0.2.3

Schema version. Set to 0.2.3 for new manifests.

namestringoptionalHuman-readable manifest name

Name for this manifest.

builderobjectrequired

buildArgsmapoptionalDocker build arguments

Key/value map of Docker build arguments (passed as --build-arg). Values may be strings or booleans.

buildArgs:
  PYTHON_VERSION: "3.11"
  USE_CUDA: true

skipBuildobjectoptionalReuse a previous experiment's image instead of building

To set whether the building step uses a previous experiment's image instead of building a new one.

enabledbooleanoptionalEnable skipping the build

Whether to skip the build and reuse a previous experiment's image. When false, the rest of this block is ignored and the build runs as usual.

Default: false

experimentIDstringoptionalUUID of the previous experiment to reuse

UUID of a previous experiment whose image should be reused.

failIfNotFoundbooleanoptionalFail instead of falling back to a normal build

When true, fail the build if the previous experiment's image can't be found (e.g. a mistyped UUID, or the image was purged by the registry's cleanup policy). When false, ignore it and build a new image

Default: false

contextstringoptionalDocker build context path

Docker build context path.

Default: "."

dockerfilestringoptionalPath to the Dockerfile

Path to the Dockerfile, relative to the repository root (e.g. ./Dockerfile).

imagestringoptionalDocker image name

Name of the Docker image to build and use. Must match spec.image.

registrystringoptionalDocker registry prefix

Registry prefix used to push and pull the image (e.g. europe-west4-docker.pkg.dev/my-project/my-repo).

targetstringoptionalDockerfile target stage

Target stage for multi-stage Dockerfiles (passed as --target).

cloneobjectoptional

depthintegeroptionalShallow clone depth

Number of commits to fetch. 1 produces a shallow clone. Increase when git log history is needed. Minimum: 1.

Default: 1

includeDotGitbooleanoptionalKeep .git directory in clone

When true, the .git directory is kept in the cloned repository. Required for submodule initialisation.

Default: false

submodulesstringoptionalSubmodule init mode

Controls submodule initialisation. Only processed when includeDotGit is true.

Value	Behaviour
`"false"`	No submodules cloned
`"true"`	Non-recursive clone
`"recursive"`	Recursive clone

Default: "false"

specobjectrequired

activeDeadlineSecondsintegeroptionalMax experiment duration in seconds

Maximum running duration for the experiment in seconds. Superseded by timeout when both are set.

commandstringrequiredCommand to run in the container

The command or script to execute inside the container.

spec:
  command: "python3 -u main.py"

debugobject

jupyterobjectoptionalJupyter server — kuberay only

Starts a Jupyter server on the Ray head node. Only available when operator is kuberay.

enabledbooleanoptionalEnable the Jupyter server

Whether to start the Jupyter server.

Default: false

pathstringoptionalPath to Jupyter binary

Path to the Jupyter binary. Defaults to $PATH resolution.

vscodeobjectoptionalVSCode tunnel — all operators

Attaches a VSCode tunnel to the experiment. Available for all operators.

enabledbooleanoptionalEnable the VSCode tunnel

Whether to start the VSCode tunnel.

Default: false

pathstringoptionalPath to VSCode CLI binary

Path to the VSCode CLI binary. Defaults to $PATH resolution.

providerstringoptionalTunnel auth provider

VSCode tunnel authentication provider. Values: "github", "microsoft".

Default: "github"

engineNamestringoptionalTarget engine name

Name of the engine to run the experiment on. When omitted, the project's default engine is used. Useful in multi-engine projects.

envmapoptionalExperiment-level environment variables

Key/value map of environment variables scoped to this experiment. Each entry is injected into the containers as a plain literal value, which makes env suited to per-run values such as hyperparameters without touching the shared project configuration.

env:
  LEARNING_RATE: "0.001"
  BATCH_SIZE: "32"
  NUM_EPOCHS: "100"

When the same key is also defined as a project-level variable, the spec.env value takes precedence. Predefined variables — the AICHOR_* variables and other protected names — cannot be set here; attempting to do so fails validation when the experiment is submitted. See Environment Variables for the full list.

experimentConfigstringoptionalExperiment-level configuration

configurations that are passed to all the experiment pods, no specific requirement on the structure as this field accepts all strings. one way you can set up these configs could be as follows

  experimentConfig: |
    config1: this-is-a-config
    config2: this-is-another-config
    parent-config:
      nested-config: this-is-nested

extraLabelsmapoptionalExtra Kubernetes pod labels

Additional Kubernetes labels applied to the experiment pods.

extraLabels:
  team: research
  project: my-project

gracefulTerminationobject

Runs a command when a pod is being terminated, so the experiment can shut down cleanly, most commonly to save a checkpoint before a spot instance is reclaimed.

The command is installed as a Kubernetes preStop hook. On termination (a spot-instance reclaim, a cancel, or any pod deletion) it runs first; once it returns, the container receives SIGTERM. The command therefore does not need to trap signals itself.

When this block is present, both shutdownCommand and terminationGracePeriodSeconds are required, and the grace period must be greater than 0. When the block is omitted entirely, no shutdown command runs and Kubernetes applies its own default grace period of 30 seconds.

gracefulTermination:
  shutdownCommand: ["python", "save_checkpoint.py"]
  terminationGracePeriodSeconds: 60

Spot instances

Cloud providers give only a short interruption notice for spot/preemptible instances. For example, AWS offers around 120 seconds. On a spot reclaim the node is drained within that window, so a terminationGracePeriodSeconds larger than the provider's notice will not be fully honoured. Keep the shutdown command well within ~120 seconds for checkpoints to complete reliably. On a normal cancel or deletion the full grace period is honoured. For more information please check the documentation of your cloud provider

shutdownCommandarray[string]requiredCommand run before the pod is killed

Command to execute inside the container before termination (e.g. to save a checkpoint). Provided as an argument vector — the executable followed by its arguments.

shutdownCommand: ["python", "save_checkpoint.py"]

terminationGracePeriodSecondsintegerrequiredGrace period before force-kill

How long, in seconds, Kubernetes waits for the shutdown sequence (the shutdownCommand followed by SIGTERM handling) to complete before force-killing the pod with SIGKILL. Must be greater than 0.

The countdown starts when termination begins, so this value must cover the full time the shutdownCommand needs to run.

imagestringrequiredDocker image to run

The Docker image to run. Must match builder.image.

operatorstringrequiredWorkload operator

The operator that manages the workload.

Value	Description
`kuberay`	Ray cluster managed by KubeRay
`jax`	JAX distributed training
`pytorch`	PyTorch distributed training
`xgboost`	XGBoost distributed training
`jobset`	Generic Kubernetes JobSet

restartPolicyobject

backoffLimitintegeroptionalMax restarts before failure

Number of allowed restarts before the experiment is marked as failed. Minimum: 0.

On jobset and kuberay, this covers both eviction (spot preemption) and software failures. On other operators it covers only software failures.

Default: 0

securityContextobject

spec:
  securityContext:
    runAsUser: 1000     # run as a non-root user...
    runAsGroup: 1000
    fsGroup: 1000       # ...and give that group ownership of mounted volumes
  storage:
    sharedVolume:
      mountPoint: /mnt/shared
      sizeGB: 50
      accessMode: ReadWriteMany
      storageClass: standard-rwo

fsGroupintegeroptionalGID applied to mounted volumes

Sets a supplementary group ID (GID) that owns all volumes mounted into the pod. Kubernetes changes the group ownership of mounted volumes to this GID and adds it to every process in the pod, so files created on those volumes are group-owned and group-writable by this GID.

This is mainly useful when the container does not run as root. By default a mounted volume (such as a PVC or a shared volume) is owned by root, so a non-root process cannot write to it. Setting fsGroup makes the volume writable by the chosen group, and lets several containers in the same pod share the same files.

In the example at the top of this section, the process runs as UID 1000 and can read and write /mnt/shared because the volume is group-owned by GID 1000. Without fsGroup, that same non-root process would be denied write access to the volume.

perfmonbooleanoptionalGrant CAP_PERFMON

Grants CAP_PERFMON to allow performance-monitoring syscalls. Required by profilers such as perf.

Default: false

ptracebooleanoptionalGrant CAP_SYS_PTRACE

Grants CAP_SYS_PTRACE to allow process tracing. Required by debuggers and profilers such as gdb and nsys.

This capability is gated by an organisation-level flag, which is disabled by default. When the organisation does not have it enabled, the setting is silently ignored (no CAP_SYS_PTRACE is granted and no error is raised). Contact the AIchor team to have it enabled for an organisation.

Default: false

runAsGroupintegeroptionalPrimary GID for all processes

Sets the primary group ID (GID) that every process in the container runs as, overriding the group defined in the image. This controls the group ownership of files the processes create and which group-restricted files they can read. It is the group counterpart to runAsUser and is commonly paired with it (and with fsGroup) to run as a specific non-root identity.

runAsUserintegeroptionalUID for all container processes

Sets the user ID (UID) that every process in the container runs as, overriding the user defined in the image's Dockerfile. Use it to run as a specific non-root user.

A value of 0 means root. When omitted, the user baked into the image is used. To write to mounted volumes as a non-root user, combine runAsUser with fsGroup so the volumes are group-writable.

storageobject

Attaches volumes to every pod in the experiment. Use attachExistingPVCs to mount Persistent Volume Claims that already exist in the project, and sharedVolume to have AIchor provision a temporary volume shared by all pods for the duration of the run.

spec:
  storage:
    attachExistingPVCs:
      - name: datasets-pvc
        mountPoint: /mnt/datasets
    sharedVolume:
      mountPoint: /mnt/shared
      sizeGB: 100
      accessMode: ReadWriteMany
      storageClass: standard-rwo

attachExistingPVCsarray[object]optionalAttach existing PVCs to all pods

List of existing Persistent Volume Claims to attach to all pods.

[].mountPointstringrequiredMount path inside the container

Mount path inside the container.

[].namestringrequiredPVC name

Name of the PVC to attach.

sharedVolumeobjectoptionalEphemeral shared volume across all pods

AIchor creates a temporary shared volume, mounts it on all pods, and deletes it when the experiment ends.

accessModestringrequiredKubernetes PV access mode

Kubernetes PV access mode, e.g. ReadWriteOnce or ReadWriteMany.

mountPointstringrequiredMount path inside the container

Mount path inside the container (e.g. /mnt/shared).

sizeGBintegerrequiredVolume size in GiB

Volume size in gigabytes.

Storage classes enforce a minimum size

Each cloud provider applies a minimum volume size per storage class. A request below that floor is rejected or silently rounded up by the CSI driver, and billing is based on the provisioned size, not the requested size. The value of sizeGB must respect the minimum of the chosen storageClass.

Common minimums:

Storage class	Provider	Minimum `sizeGB`
`gp2`, `gp3` (EBS)	AWS	1
`io1`, `io2` (EBS)	AWS	4
`pd-standard`, `pd-balanced`, `pd-ssd` (Persistent Disk)	GCP	10
`standard` / `premium` (Filestore, NFS)	GCP	1024 / 2560

For example, requesting sizeGB: 1 against a pd-balanced storage class on GCP still provisions a 10 GiB disk, and a Filestore-backed class is billed for at least 1 TiB regardless of the requested value.

storageClassstringrequiredKubernetes storage class

Kubernetes storage class name (e.g. gp2, standard-rwo, ceph-rbd, cephfs, longhorn).

tensorboardobject

enabledbooleanrequiredEnable TensorBoard

Whether to run a TensorBoard sidecar for the experiment. When enabled, the TensorBoard dashboard is accessible from the experiment's page.

usePVCobjectoptionalRead logs from a PVC

When set, TensorBoard reads logs from a PVC instead of object storage. Requires spec.storage.attachExistingPVCs to be configured.

enabledbooleanrequiredUse PVC for TensorBoard logs

Whether to mount a PVC for TensorBoard log storage.

namestringoptionalPVC name from attachExistingPVCs

Name of a PVC from spec.storage.attachExistingPVCs. When omitted, the first PVC in the list is used.

timeoutstringoptionalMax duration — e.g. "1d 6h"

Maximum duration as a human-readable string. Takes precedence over activeDeadlineSeconds.

Accepts combinations of w (weeks), d (days), h (hours), m (minutes), s (seconds).

timeout: "1w 2d 5h 6m 20s"

typesobjectrequired

Defines worker pools — required field names depend on the operator. Select an operator below to see the structure of spec.types for that operator.

Across all operators, spec.types is required and contains the worker-pool specifications. The exact shape depends on the operator:

jax, pytorch, xgboost — single worker pool.
jobset — named replicated jobs (free-form job keys).
kuberay — one Head pod plus one or more Workers pools. For all Kubernetes-based operators (jax, pytorch, xgboost, jobset, kuberay), each pool's resources accepts the same fields — select an operator tab to see the full schema inline.

See the Jax framework page for more info.

Example spec.types section:

types:
  worker:
    count: 2
    resources:
      cpus: 16
      memory: 64
      accelerators:
        gpu:
          count: 1
          product: Tesla-V100-SXM3-32GB

workerobjectrequiredSingle homogeneous worker pool

countintegeroptionalNumber of pods

Number of pods. Minimum: 0.

Default: 1

resourcesobjectrequired

acceleratorsobjectoptional

gpu.countintegeroptionalNumber of GPUs

Number of GPUs to request. Certain products enforce per-product min/max counts.

Default: 0

gpu.productstringrequired when count > 0GPU product name

Sets the nvidia.com/gpu.product node selector (e.g. Tesla-V100-SXM3-32GB, NVIDIA-H100-80GB-HBM3). The available products depend on the engine and cloud provider.

gpu.typestringoptionalGPU slice type

GPU slice type. Values: "gpu" "mig-1g.10gb" "mig-3g.20gb" "mig-3g.40gb". MIG types are only supported on products that explicitly allow them.

Default: "gpu"

tpu.topologystringrequiredTPU topology string, e.g. 2x2

TPU topology string (e.g. "2x2", "4x4"). See the GCP TPU docs.

tpu.tpuChipsCountintegeroptionalTPU chips per Kubernetes pod

Number of TPU chips per Kubernetes pod (chips per VM). Used to compute VMs for multi-host slices: VMs = topology_product / tpuChipsCount.

Default: 0

tpu.typestringrequiredTPU accelerator type — GCP GKE only

TPU accelerator type. Values: tpu-v4-podslice tpu-v5-lite-podslice tpu-v5p-slice tpu-v6e-slice

cpuLimitRatiointegeroptionalCPU limit multiplier (1 or 2)

CPU limit = cpus × cpuLimitRatio. Allowed values: 1, 2.

Setting to 1 prevents CPU bursting.

cpuLimitRatio of 2 means the CPU limit is 200% — the container is allowed to burst up to 2× the requested CPU.

Default: 2

cpusintegeroptionalCPU cores requested

Number of CPU cores requested. Minimum: 1. Certain GPU products enforce per-product min/max CPU counts.

Default: 1

extraSelectorsmapoptionalExtra node selector labels

Additional Kubernetes node selector labels applied to experiment pods. Commonly used to request spot or on-demand instances.

extraSelectors:
  karpenter.sh/capacity-type: spot   # AWS EKS / Azure AKS

extraTolerationsarray[object]optionalPod tolerations

Pod tolerations applied to experiment pods. This will allow the experiment pods to schedule on taint protected nodes.

[].effectstringoptionalTaint effect: NoSchedule, PreferNoSchedule, NoExecute

Taint effect to match. Empty matches all effects. Values: "NoSchedule", "PreferNoSchedule", "NoExecute".

[].keystringrequiredTaint key to tolerate

Taint key. An empty string matches all taint keys (requires operator: Exists).

[].operatorstringoptionalOperator: Equal or Exists

Relationship operator. "Exists" matches any value; "Equal" matches a specific value.

Default: "Equal"

[].tolerationSecondsintegeroptionalSeconds to tolerate NoExecute taint

For NoExecute taints only: how long to tolerate before eviction. 0 evicts immediately. Omit to tolerate indefinitely.

[].valuestringoptionalTaint value to match

Taint value to match. Must be empty when operator is "Exists".

machineNamestringoptionalTarget a specific node hostname

Sets the kubernetes.io/hostname node selector to target a specific machine.

memoryintegeroptionalMemory in GiB — mutually exclusive with ramRatio

Memory in GiB, specified directly. Mutually exclusive with ramRatio — exactly one must be set. Certain GPU products enforce min/max RAM constraints.

ramRatiointegeroptionalRAM = cpus × ramRatio GiB — mutually exclusive with memory

RAM multiplier: total memory = cpus × ramRatio GiB. Mutually exclusive with memory — exactly one must be set. Certain GPU products enforce min/max RAM constraints.

rdmaobjectoptional

devicesarray[string]requiredRDMA device names to request

List of RDMA device names. GPU nodes: sriov_a, sriov_b, sriov_c, sriov_d. CPU nodes: sriov_a, sriov_b.

shmSizeGBintegeroptionalShared memory size at /dev/shm

Shared memory size in GiB, mounted at /dev/shm.

Value	Effect
`> 0`	shm volume created of that size; memory limit increased by `shmSizeGB`
`0`	No shm volume created
omitted	shm volume auto-sized to 10% of memory limit

See the JobSet framework page for more info.

Example: job pools use free-form names that match ^[a-z0-9-]+$ (e.g. master, worker-cpu):

types:
  master:
    count: 1
    completions: 1
    parallelisms: 1
    resources:
      cpus: 16
      ramRatio: 4
  worker-cpu:
    count: 2
    completions: 1
    parallelisms: 1
    resources:
      cpus: 4
      ramRatio: 2

<job-name>maprequiredReplicated job specs keyed by job name

completionsintegeroptionalPods that must succeed per replica

Number of pods that must complete successfully per replica. Minimum: 0.

Default: 1

countintegeroptionalNumber of job replicas

Number of replicas of this job. Minimum: 0.

Default: 1

parallelismsintegeroptionalPods running in parallel per replica

Number of pods that run in parallel per replica. Minimum: 0.

Default: 1

resourcesobjectrequired

acceleratorsobjectoptional

gpu.countintegeroptionalNumber of GPUs

Number of GPUs to request. Certain products enforce per-product min/max counts.

Default: 0

gpu.productstringrequired when count > 0GPU product name

Sets the nvidia.com/gpu.product node selector (e.g. Tesla-V100-SXM3-32GB, NVIDIA-H100-80GB-HBM3). The available products depend on the engine and cloud provider.

gpu.typestringoptionalGPU slice type

GPU slice type. Values: "gpu" "mig-1g.10gb" "mig-3g.20gb" "mig-3g.40gb". MIG types are only supported on products that explicitly allow them.

Default: "gpu"

tpu.topologystringrequiredTPU topology string, e.g. 2x2

TPU topology string (e.g. "2x2", "4x4"). See the GCP TPU docs.

tpu.tpuChipsCountintegeroptionalTPU chips per Kubernetes pod

Number of TPU chips per Kubernetes pod (chips per VM). Used to compute VMs for multi-host slices: VMs = topology_product / tpuChipsCount.

Default: 0

tpu.typestringrequiredTPU accelerator type — GCP GKE only

TPU accelerator type. Values: tpu-v4-podslice tpu-v5-lite-podslice tpu-v5p-slice tpu-v6e-slice

cpuLimitRatiointegeroptionalCPU limit multiplier (1 or 2)

CPU limit = cpus × cpuLimitRatio. Allowed values: 1, 2.

Setting to 1 prevents CPU bursting.

cpuLimitRatio of 2 means the CPU limit is 200% — the container is allowed to burst up to 2× the requested CPU.

Default: 2

cpusintegeroptionalCPU cores requested

Number of CPU cores requested. Minimum: 1. Certain GPU products enforce per-product min/max CPU counts.

Default: 1

extraSelectorsmapoptionalExtra node selector labels

Additional Kubernetes node selector labels applied to experiment pods. Commonly used to request spot or on-demand instances.

extraSelectors:
  karpenter.sh/capacity-type: spot   # AWS EKS / Azure AKS

extraTolerationsarray[object]optionalPod tolerations

Pod tolerations applied to experiment pods. This will allow the experiment pods to schedule on taint protected nodes.

[].effectstringoptionalTaint effect: NoSchedule, PreferNoSchedule, NoExecute

Taint effect to match. Empty matches all effects. Values: "NoSchedule", "PreferNoSchedule", "NoExecute".

[].keystringrequiredTaint key to tolerate

Taint key. An empty string matches all taint keys (requires operator: Exists).

[].operatorstringoptionalOperator: Equal or Exists

Relationship operator. "Exists" matches any value; "Equal" matches a specific value.

Default: "Equal"

[].tolerationSecondsintegeroptionalSeconds to tolerate NoExecute taint

For NoExecute taints only: how long to tolerate before eviction. 0 evicts immediately. Omit to tolerate indefinitely.

[].valuestringoptionalTaint value to match

Taint value to match. Must be empty when operator is "Exists".

machineNamestringoptionalTarget a specific node hostname

Sets the kubernetes.io/hostname node selector to target a specific machine.

memoryintegeroptionalMemory in GiB — mutually exclusive with ramRatio

Memory in GiB, specified directly. Mutually exclusive with ramRatio — exactly one must be set. Certain GPU products enforce min/max RAM constraints.

ramRatiointegeroptionalRAM = cpus × ramRatio GiB — mutually exclusive with memory

RAM multiplier: total memory = cpus × ramRatio GiB. Mutually exclusive with memory — exactly one must be set. Certain GPU products enforce min/max RAM constraints.

rdmaobjectoptional

devicesarray[string]requiredRDMA device names to request

List of RDMA device names. GPU nodes: sriov_a, sriov_b, sriov_c, sriov_d. CPU nodes: sriov_a, sriov_b.

shmSizeGBintegeroptionalShared memory size at /dev/shm

Shared memory size in GiB, mounted at /dev/shm.

Value	Effect
`> 0`	shm volume created of that size; memory limit increased by `shmSizeGB`
`0`	No shm volume created
omitted	shm volume auto-sized to 10% of memory limit

See the KubeRay framework page for more info.

Example: one Head pod plus one or more Workers pools:

types:
  Head:
    resources:
      cpus: 8
      ramRatio: 2
    rayStartParams:
      object-store-memory: "1000000000"  # cap the Ray object store at ~1 GB
  Workers:
    - name: gpu-workers
      count: 2
      resources:
        cpus: 16
        memory: 64
        accelerators:
          gpu:
            count: 2
            product: Tesla-V100-SXM3-32GB

Also accepts spec.rayVersion:

HeadobjectrequiredSingle Ray head node — no count field

rayStartParamsmapoptionalExtra ray start CLI parameters

Additional ray start CLI parameters. See the Ray CLI reference.

resourcesobjectrequired

acceleratorsobjectoptional

gpu.countintegeroptionalNumber of GPUs

Number of GPUs to request. Certain products enforce per-product min/max counts.

Default: 0

gpu.productstringrequired when count > 0GPU product name

Sets the nvidia.com/gpu.product node selector (e.g. Tesla-V100-SXM3-32GB, NVIDIA-H100-80GB-HBM3). The available products depend on the engine and cloud provider.

gpu.typestringoptionalGPU slice type

GPU slice type. Values: "gpu" "mig-1g.10gb" "mig-3g.20gb" "mig-3g.40gb". MIG types are only supported on products that explicitly allow them.

Default: "gpu"

tpu.topologystringrequiredTPU topology string, e.g. 2x2

TPU topology string (e.g. "2x2", "4x4"). See the GCP TPU docs.

tpu.tpuChipsCountintegeroptionalTPU chips per Kubernetes pod

Number of TPU chips per Kubernetes pod (chips per VM). Used to compute VMs for multi-host slices: VMs = topology_product / tpuChipsCount.

Default: 0

tpu.typestringrequiredTPU accelerator type — GCP GKE only

TPU accelerator type. Values: tpu-v4-podslice tpu-v5-lite-podslice tpu-v5p-slice tpu-v6e-slice

cpuLimitRatiointegeroptionalCPU limit multiplier (1 or 2)

CPU limit = cpus × cpuLimitRatio. Allowed values: 1, 2.

Setting to 1 prevents CPU bursting.

cpuLimitRatio of 2 means the CPU limit is 200% — the container is allowed to burst up to 2× the requested CPU.

Default: 2

cpusintegeroptionalCPU cores requested

Number of CPU cores requested. Minimum: 1. Certain GPU products enforce per-product min/max CPU counts.

Default: 1

extraSelectorsmapoptionalExtra node selector labels

Additional Kubernetes node selector labels applied to experiment pods. Commonly used to request spot or on-demand instances.

extraSelectors:
  karpenter.sh/capacity-type: spot   # AWS EKS / Azure AKS

extraTolerationsarray[object]optionalPod tolerations

Pod tolerations applied to experiment pods. This will allow the experiment pods to schedule on taint protected nodes.

[].effectstringoptionalTaint effect: NoSchedule, PreferNoSchedule, NoExecute

Taint effect to match. Empty matches all effects. Values: "NoSchedule", "PreferNoSchedule", "NoExecute".

[].keystringrequiredTaint key to tolerate

Taint key. An empty string matches all taint keys (requires operator: Exists).

[].operatorstringoptionalOperator: Equal or Exists

Relationship operator. "Exists" matches any value; "Equal" matches a specific value.

Default: "Equal"

[].tolerationSecondsintegeroptionalSeconds to tolerate NoExecute taint

For NoExecute taints only: how long to tolerate before eviction. 0 evicts immediately. Omit to tolerate indefinitely.

[].valuestringoptionalTaint value to match

Taint value to match. Must be empty when operator is "Exists".

machineNamestringoptionalTarget a specific node hostname

Sets the kubernetes.io/hostname node selector to target a specific machine.

memoryintegeroptionalMemory in GiB — mutually exclusive with ramRatio

Memory in GiB, specified directly. Mutually exclusive with ramRatio — exactly one must be set. Certain GPU products enforce min/max RAM constraints.

ramRatiointegeroptionalRAM = cpus × ramRatio GiB — mutually exclusive with memory

RAM multiplier: total memory = cpus × ramRatio GiB. Mutually exclusive with memory — exactly one must be set. Certain GPU products enforce min/max RAM constraints.

rdmaobjectoptional

devicesarray[string]requiredRDMA device names to request

List of RDMA device names. GPU nodes: sriov_a, sriov_b, sriov_c, sriov_d. CPU nodes: sriov_a, sriov_b.

shmSizeGBintegeroptionalShared memory size at /dev/shm

Shared memory size in GiB, mounted at /dev/shm.

Value	Effect
`> 0`	shm volume created of that size; memory limit increased by `shmSizeGB`
`0`	No shm volume created
omitted	shm volume auto-sized to 10% of memory limit

rayVersionstringoptionalRay version to deploy

Ray version to deploy. Must follow semantic versioning M.m.p.

Default: "2.23.0"

Workers[]array[object]requiredWorker pool specs — at least one

[].countintegeroptionalNumber of workers in the pool

Number of workers in the pool. Minimum: 0.

Default: 0

[].namestringrequiredWorker pool name

Name of the worker pool.

[].rayStartParamsmapoptionalExtra ray start CLI parameters

Additional ray start CLI parameters for this pool. See the Ray CLI reference.

[].resourcesobjectrequired

acceleratorsobjectoptional

gpu.countintegeroptionalNumber of GPUs

Number of GPUs to request. Certain products enforce per-product min/max counts.

Default: 0

gpu.productstringrequired when count > 0GPU product name

Sets the nvidia.com/gpu.product node selector (e.g. Tesla-V100-SXM3-32GB, NVIDIA-H100-80GB-HBM3). The available products depend on the engine and cloud provider.

gpu.typestringoptionalGPU slice type

GPU slice type. Values: "gpu" "mig-1g.10gb" "mig-3g.20gb" "mig-3g.40gb". MIG types are only supported on products that explicitly allow them.

Default: "gpu"

tpu.topologystringrequiredTPU topology string, e.g. 2x2

TPU topology string (e.g. "2x2", "4x4"). See the GCP TPU docs.

tpu.tpuChipsCountintegeroptionalTPU chips per Kubernetes pod

Number of TPU chips per Kubernetes pod (chips per VM). Used to compute VMs for multi-host slices: VMs = topology_product / tpuChipsCount.

Default: 0

tpu.typestringrequiredTPU accelerator type — GCP GKE only

TPU accelerator type. Values: tpu-v4-podslice tpu-v5-lite-podslice tpu-v5p-slice tpu-v6e-slice

cpuLimitRatiointegeroptionalCPU limit multiplier (1 or 2)

CPU limit = cpus × cpuLimitRatio. Allowed values: 1, 2.

Setting to 1 prevents CPU bursting.

cpuLimitRatio of 2 means the CPU limit is 200% — the container is allowed to burst up to 2× the requested CPU.

Default: 2

cpusintegeroptionalCPU cores requested

Number of CPU cores requested. Minimum: 1. Certain GPU products enforce per-product min/max CPU counts.

Default: 1

extraSelectorsmapoptionalExtra node selector labels

Additional Kubernetes node selector labels applied to experiment pods. Commonly used to request spot or on-demand instances.

extraSelectors:
  karpenter.sh/capacity-type: spot   # AWS EKS / Azure AKS

extraTolerationsarray[object]optionalPod tolerations

Pod tolerations applied to experiment pods. This will allow the experiment pods to schedule on taint protected nodes.

[].effectstringoptionalTaint effect: NoSchedule, PreferNoSchedule, NoExecute

Taint effect to match. Empty matches all effects. Values: "NoSchedule", "PreferNoSchedule", "NoExecute".

[].keystringrequiredTaint key to tolerate

Taint key. An empty string matches all taint keys (requires operator: Exists).

[].operatorstringoptionalOperator: Equal or Exists

Relationship operator. "Exists" matches any value; "Equal" matches a specific value.

Default: "Equal"

[].tolerationSecondsintegeroptionalSeconds to tolerate NoExecute taint

For NoExecute taints only: how long to tolerate before eviction. 0 evicts immediately. Omit to tolerate indefinitely.

[].valuestringoptionalTaint value to match

Taint value to match. Must be empty when operator is "Exists".

machineNamestringoptionalTarget a specific node hostname

Sets the kubernetes.io/hostname node selector to target a specific machine.

memoryintegeroptionalMemory in GiB — mutually exclusive with ramRatio

Memory in GiB, specified directly. Mutually exclusive with ramRatio — exactly one must be set. Certain GPU products enforce min/max RAM constraints.

ramRatiointegeroptionalRAM = cpus × ramRatio GiB — mutually exclusive with memory

RAM multiplier: total memory = cpus × ramRatio GiB. Mutually exclusive with memory — exactly one must be set. Certain GPU products enforce min/max RAM constraints.

rdmaobjectoptional

devicesarray[string]requiredRDMA device names to request

List of RDMA device names. GPU nodes: sriov_a, sriov_b, sriov_c, sriov_d. CPU nodes: sriov_a, sriov_b.

shmSizeGBintegeroptionalShared memory size at /dev/shm

Shared memory size in GiB, mounted at /dev/shm.

Value	Effect
`> 0`	shm volume created of that size; memory limit increased by `shmSizeGB`
`0`	No shm volume created
omitted	shm volume auto-sized to 10% of memory limit

See the PyTorch framework page for more info.

Example spec.types section:

types:
  worker:
    count: 4
    resources:
      cpus: 12
      memory: 32
      accelerators:
        gpu:
          count: 1
          product: Tesla-V100-SXM3-32GB

workerobjectrequiredSingle homogeneous worker pool

countintegeroptionalNumber of pods

Number of pods. Minimum: 0.

Default: 1

resourcesobjectrequired

acceleratorsobjectoptional

gpu.countintegeroptionalNumber of GPUs

Number of GPUs to request. Certain products enforce per-product min/max counts.

Default: 0

gpu.productstringrequired when count > 0GPU product name

Sets the nvidia.com/gpu.product node selector (e.g. Tesla-V100-SXM3-32GB, NVIDIA-H100-80GB-HBM3). The available products depend on the engine and cloud provider.

gpu.typestringoptionalGPU slice type

GPU slice type. Values: "gpu" "mig-1g.10gb" "mig-3g.20gb" "mig-3g.40gb". MIG types are only supported on products that explicitly allow them.

Default: "gpu"

tpu.topologystringrequiredTPU topology string, e.g. 2x2

TPU topology string (e.g. "2x2", "4x4"). See the GCP TPU docs.

tpu.tpuChipsCountintegeroptionalTPU chips per Kubernetes pod

Number of TPU chips per Kubernetes pod (chips per VM). Used to compute VMs for multi-host slices: VMs = topology_product / tpuChipsCount.

Default: 0

tpu.typestringrequiredTPU accelerator type — GCP GKE only

TPU accelerator type. Values: tpu-v4-podslice tpu-v5-lite-podslice tpu-v5p-slice tpu-v6e-slice

cpuLimitRatiointegeroptionalCPU limit multiplier (1 or 2)

CPU limit = cpus × cpuLimitRatio. Allowed values: 1, 2.

Setting to 1 prevents CPU bursting.

cpuLimitRatio of 2 means the CPU limit is 200% — the container is allowed to burst up to 2× the requested CPU.

Default: 2

cpusintegeroptionalCPU cores requested

Number of CPU cores requested. Minimum: 1. Certain GPU products enforce per-product min/max CPU counts.

Default: 1

extraSelectorsmapoptionalExtra node selector labels

Additional Kubernetes node selector labels applied to experiment pods. Commonly used to request spot or on-demand instances.

extraSelectors:
  karpenter.sh/capacity-type: spot   # AWS EKS / Azure AKS

extraTolerationsarray[object]optionalPod tolerations

Pod tolerations applied to experiment pods. This will allow the experiment pods to schedule on taint protected nodes.

[].effectstringoptionalTaint effect: NoSchedule, PreferNoSchedule, NoExecute

Taint effect to match. Empty matches all effects. Values: "NoSchedule", "PreferNoSchedule", "NoExecute".

[].keystringrequiredTaint key to tolerate

Taint key. An empty string matches all taint keys (requires operator: Exists).

[].operatorstringoptionalOperator: Equal or Exists

Relationship operator. "Exists" matches any value; "Equal" matches a specific value.

Default: "Equal"

[].tolerationSecondsintegeroptionalSeconds to tolerate NoExecute taint

For NoExecute taints only: how long to tolerate before eviction. 0 evicts immediately. Omit to tolerate indefinitely.

[].valuestringoptionalTaint value to match

Taint value to match. Must be empty when operator is "Exists".

machineNamestringoptionalTarget a specific node hostname

Sets the kubernetes.io/hostname node selector to target a specific machine.

memoryintegeroptionalMemory in GiB — mutually exclusive with ramRatio

Memory in GiB, specified directly. Mutually exclusive with ramRatio — exactly one must be set. Certain GPU products enforce min/max RAM constraints.

ramRatiointegeroptionalRAM = cpus × ramRatio GiB — mutually exclusive with memory

RAM multiplier: total memory = cpus × ramRatio GiB. Mutually exclusive with memory — exactly one must be set. Certain GPU products enforce min/max RAM constraints.

rdmaobjectoptional

devicesarray[string]requiredRDMA device names to request

List of RDMA device names. GPU nodes: sriov_a, sriov_b, sriov_c, sriov_d. CPU nodes: sriov_a, sriov_b.

shmSizeGBintegeroptionalShared memory size at /dev/shm

Shared memory size in GiB, mounted at /dev/shm.

Value	Effect
`> 0`	shm volume created of that size; memory limit increased by `shmSizeGB`
`0`	No shm volume created
omitted	shm volume auto-sized to 10% of memory limit

See the XGBoost framework page for more info.

Example spec.types section:

types:
  worker:
    count: 3
    resources:
      cpus: 8
      memory: 16

workerobjectrequiredSingle homogeneous worker pool

countintegeroptionalNumber of pods

Number of pods. Minimum: 0.

Default: 1

resourcesobjectrequired

acceleratorsobjectoptional

gpu.countintegeroptionalNumber of GPUs

Number of GPUs to request. Certain products enforce per-product min/max counts.

Default: 0

gpu.productstringrequired when count > 0GPU product name

Sets the nvidia.com/gpu.product node selector (e.g. Tesla-V100-SXM3-32GB, NVIDIA-H100-80GB-HBM3). The available products depend on the engine and cloud provider.

gpu.typestringoptionalGPU slice type

GPU slice type. Values: "gpu" "mig-1g.10gb" "mig-3g.20gb" "mig-3g.40gb". MIG types are only supported on products that explicitly allow them.

Default: "gpu"

tpu.topologystringrequiredTPU topology string, e.g. 2x2

TPU topology string (e.g. "2x2", "4x4"). See the GCP TPU docs.

tpu.tpuChipsCountintegeroptionalTPU chips per Kubernetes pod

Number of TPU chips per Kubernetes pod (chips per VM). Used to compute VMs for multi-host slices: VMs = topology_product / tpuChipsCount.

Default: 0

tpu.typestringrequiredTPU accelerator type — GCP GKE only

TPU accelerator type. Values: tpu-v4-podslice tpu-v5-lite-podslice tpu-v5p-slice tpu-v6e-slice

cpuLimitRatiointegeroptionalCPU limit multiplier (1 or 2)

CPU limit = cpus × cpuLimitRatio. Allowed values: 1, 2.

Setting to 1 prevents CPU bursting.

cpuLimitRatio of 2 means the CPU limit is 200% — the container is allowed to burst up to 2× the requested CPU.

Default: 2

cpusintegeroptionalCPU cores requested

Number of CPU cores requested. Minimum: 1. Certain GPU products enforce per-product min/max CPU counts.

Default: 1

extraSelectorsmapoptionalExtra node selector labels

Additional Kubernetes node selector labels applied to experiment pods. Commonly used to request spot or on-demand instances.

extraSelectors:
  karpenter.sh/capacity-type: spot   # AWS EKS / Azure AKS

extraTolerationsarray[object]optionalPod tolerations

Pod tolerations applied to experiment pods. This will allow the experiment pods to schedule on taint protected nodes.

[].effectstringoptionalTaint effect: NoSchedule, PreferNoSchedule, NoExecute

Taint effect to match. Empty matches all effects. Values: "NoSchedule", "PreferNoSchedule", "NoExecute".

[].keystringrequiredTaint key to tolerate

Taint key. An empty string matches all taint keys (requires operator: Exists).

[].operatorstringoptionalOperator: Equal or Exists

Relationship operator. "Exists" matches any value; "Equal" matches a specific value.

Default: "Equal"

[].tolerationSecondsintegeroptionalSeconds to tolerate NoExecute taint

For NoExecute taints only: how long to tolerate before eviction. 0 evicts immediately. Omit to tolerate indefinitely.

[].valuestringoptionalTaint value to match

Taint value to match. Must be empty when operator is "Exists".

machineNamestringoptionalTarget a specific node hostname

Sets the kubernetes.io/hostname node selector to target a specific machine.

memoryintegeroptionalMemory in GiB — mutually exclusive with ramRatio

Memory in GiB, specified directly. Mutually exclusive with ramRatio — exactly one must be set. Certain GPU products enforce min/max RAM constraints.

ramRatiointegeroptionalRAM = cpus × ramRatio GiB — mutually exclusive with memory

RAM multiplier: total memory = cpus × ramRatio GiB. Mutually exclusive with memory — exactly one must be set. Certain GPU products enforce min/max RAM constraints.

rdmaobjectoptional

devicesarray[string]requiredRDMA device names to request

List of RDMA device names. GPU nodes: sriov_a, sriov_b, sriov_c, sriov_d. CPU nodes: sriov_a, sriov_b.

shmSizeGBintegeroptionalShared memory size at /dev/shm

Shared memory size in GiB, mounted at /dev/shm.

Value	Effect
`> 0`	shm volume created of that size; memory limit increased by `shmSizeGB`
`0`	No shm volume created
omitted	shm volume auto-sized to 10% of memory limit