Manifest Reference
Complete field-by-field reference for manifest.yaml.
See the Manifest Examples page for complete, per-operator manifest.yaml files.
kindstringrequiredAlways AIchorManifest
Always AIchorManifest.
apiVersionstringrequiredSchema version — current: 0.2.3
Schema version. Set to 0.2.3 for new manifests.
namestringoptionalHuman-readable manifest name
Name for this manifest.
builderobjectrequired
buildArgsmapoptionalDocker build arguments
Key/value map of Docker build arguments (passed as --build-arg). Values may be strings or booleans.
buildArgs:
PYTHON_VERSION: "3.11"
USE_CUDA: true
contextstringoptionalDocker build context path
Docker build context path.
Default: "."
dockerfilestringoptionalPath to the Dockerfile
Path to the Dockerfile, relative to the repository root (e.g. ./Dockerfile).
imagestringoptionalDocker image name
Name of the Docker image to build and use. Must match spec.image.
registrystringoptionalDocker registry prefix
Registry prefix used to push and pull the image (e.g. europe-west4-docker.pkg.dev/my-project/my-repo).
targetstringoptionalDockerfile target stage
Target stage for multi-stage Dockerfiles (passed as --target).
cloneobjectoptional
depthintegeroptionalShallow clone depth
Number of commits to fetch. 1 produces a shallow clone. Increase when git log history is needed. Minimum: 1.
Default: 1
includeDotGitbooleanoptionalKeep .git directory in clone
When true, the .git directory is kept in the cloned repository. Required for submodule initialisation.
Default: false
submodulesstringoptionalSubmodule init mode
Controls submodule initialisation. Only processed when includeDotGit is true.
| Value | Behaviour |
|---|---|
"false" | No submodules cloned |
"true" | Non-recursive clone |
"recursive" | Recursive clone |
Default: "false"
specobjectrequired
activeDeadlineSecondsintegeroptionalMax experiment duration in seconds
Maximum running duration for the experiment in seconds. Superseded by timeout when both are set.
commandstringrequiredCommand to run in the container
The command or script to execute inside the container.
spec:
command: "python3 -u main.py"
debugobject
jupyterobjectoptionalJupyter server — kuberay only
Starts a Jupyter server on the Ray head node. Only available when operator is kuberay.
enabledbooleanoptionalEnable the Jupyter server
Whether to start the Jupyter server.
Default: false
pathstringoptionalPath to Jupyter binary
Path to the Jupyter binary. Defaults to $PATH resolution.
vscodeobjectoptionalVSCode tunnel — all operators
Attaches a VSCode tunnel to the experiment. Available for all operators.
enabledbooleanoptionalEnable the VSCode tunnel
Whether to start the VSCode tunnel.
Default: false
pathstringoptionalPath to VSCode CLI binary
Path to the VSCode CLI binary. Defaults to $PATH resolution.
providerstringoptionalTunnel auth provider
VSCode tunnel authentication provider. Values: "github", "microsoft".
Default: "github"
engineNamestringoptionalTarget engine name
Name of the engine to run the experiment on. When omitted, the project's default engine is used. Useful in multi-engine projects.
envmapoptionalExperiment-level environment variables
Key/value map of environment variables scoped to this experiment. Each entry is injected into the containers as a plain literal value, which makes env suited to per-run values such as hyperparameters without touching the shared project configuration.
env:
LEARNING_RATE: "0.001"
BATCH_SIZE: "32"
NUM_EPOCHS: "100"
When the same key is also defined as a project-level variable, the spec.env value takes precedence. Predefined variables — the AICHOR_* variables and other protected names — cannot be set here; attempting to do so fails validation when the experiment is submitted. See Environment Variables for the full list.
extraLabelsmapoptionalExtra Kubernetes pod labels
Additional Kubernetes labels applied to the experiment pods.
extraLabels:
team: research
project: my-project
gracefulTerminationobject
Runs a command when a pod is being terminated, so the experiment can shut down cleanly, most commonly to save a checkpoint before a spot instance is reclaimed.
The command is installed as a Kubernetes preStop hook. On termination (a spot-instance reclaim, a cancel, or any pod deletion) it runs first; once it returns, the container receives SIGTERM. The command therefore does not need to trap signals itself.
When this block is present, both shutdownCommand and terminationGracePeriodSeconds are required, and the grace period must be greater than 0. When the block is omitted entirely, no shutdown command runs and Kubernetes applies its own default grace period of 30 seconds.
gracefulTermination:
shutdownCommand: ["python", "save_checkpoint.py"]
terminationGracePeriodSeconds: 60
Cloud providers give only a short interruption notice for spot/preemptible instances. For example, AWS offers around 120 seconds. On a spot reclaim the node is drained within that window, so a terminationGracePeriodSeconds larger than the provider's notice will not be fully honoured. Keep the shutdown command well within ~120 seconds for checkpoints to complete reliably. On a normal cancel or deletion the full grace period is honoured. For more information please check the documentation of your cloud provider
shutdownCommandarray[string]requiredCommand run before the pod is killed
Command to execute inside the container before termination (e.g. to save a checkpoint). Provided as an argument vector — the executable followed by its arguments.
shutdownCommand: ["python", "save_checkpoint.py"]
terminationGracePeriodSecondsintegerrequiredGrace period before force-kill
How long, in seconds, Kubernetes waits for the shutdown sequence (the shutdownCommand followed by SIGTERM handling) to complete before force-killing the pod with SIGKILL. Must be greater than 0.
The countdown starts when termination begins, so this value must cover the full time the shutdownCommand needs to run.
imagestringrequiredDocker image to run
The Docker image to run. Must match builder.image.
operatorstringrequiredWorkload operator
The operator that manages the workload.
| Value | Description |
|---|---|
kuberay | Ray cluster managed by KubeRay |
jax | JAX distributed training |
pytorch | PyTorch distributed training |
xgboost | XGBoost distributed training |
jobset | Generic Kubernetes JobSet |
restartPolicyobject
backoffLimitintegeroptionalMax restarts before failure
Number of allowed restarts before the experiment is marked as failed. Minimum: 0.
On jobset and kuberay, this covers both eviction (spot preemption) and software failures. On other operators it covers only software failures.
Default: 0
securityContextobject
spec:
securityContext:
runAsUser: 1000 # run as a non-root user...
runAsGroup: 1000
fsGroup: 1000 # ...and give that group ownership of mounted volumes
storage:
sharedVolume:
mountPoint: /mnt/shared
sizeGB: 50
accessMode: ReadWriteMany
storageClass: standard-rwo
fsGroupintegeroptionalGID applied to mounted volumes
Sets a supplementary group ID (GID) that owns all volumes mounted into the pod. Kubernetes changes the group ownership of mounted volumes to this GID and adds it to every process in the pod, so files created on those volumes are group-owned and group-writable by this GID.
This is mainly useful when the container does not run as root. By default a mounted volume (such as a PVC or a shared volume) is owned by root, so a non-root process cannot write to it. Setting fsGroup makes the volume writable by the chosen group, and lets several containers in the same pod share the same files.
In the example at the top of this section, the process runs as UID 1000 and can read and write /mnt/shared because the volume is group-owned by GID 1000. Without fsGroup, that same non-root process would be denied write access to the volume.
perfmonbooleanoptionalGrant CAP_PERFMON
Grants CAP_PERFMON to allow performance-monitoring syscalls. Required by profilers such as perf.
Default: false
ptracebooleanoptionalGrant CAP_SYS_PTRACE
Grants CAP_SYS_PTRACE to allow process tracing. Required by debuggers and profilers such as gdb and nsys.
This capability is gated by an organisation-level flag, which is disabled by default. When the organisation does not have it enabled, the setting is silently ignored (no CAP_SYS_PTRACE is granted and no error is raised). Contact the AIchor team to have it enabled for an organisation.
Default: false
runAsGroupintegeroptionalPrimary GID for all processes
Sets the primary group ID (GID) that every process in the container runs as, overriding the group defined in the image. This controls the group ownership of files the processes create and which group-restricted files they can read. It is the group counterpart to runAsUser and is commonly paired with it (and with fsGroup) to run as a specific non-root identity.
runAsUserintegeroptionalUID for all container processes
Sets the user ID (UID) that every process in the container runs as, overriding the user defined in the image's Dockerfile. Use it to run as a specific non-root user.
A value of 0 means root. When omitted, the user baked into the image is used. To write to mounted volumes as a non-root user, combine runAsUser with fsGroup so the volumes are group-writable.
storageobject
Attaches volumes to every pod in the experiment. Use attachExistingPVCs to mount Persistent Volume Claims that already exist in the project, and sharedVolume to have AIchor provision a temporary volume shared by all pods for the duration of the run.
spec:
storage:
attachExistingPVCs:
- name: datasets-pvc
mountPoint: /mnt/datasets
sharedVolume:
mountPoint: /mnt/shared
sizeGB: 100
accessMode: ReadWriteMany
storageClass: standard-rwo
attachExistingPVCsarray[object]optionalAttach existing PVCs to all pods
List of existing Persistent Volume Claims to attach to all pods.
[].mountPointstringrequiredMount path inside the container
Mount path inside the container.
[].namestringrequiredPVC name
Name of the PVC to attach.
sharedVolumeobjectoptionalEphemeral shared volume across all pods
AIchor creates a temporary shared volume, mounts it on all pods, and deletes it when the experiment ends.
accessModestringrequiredKubernetes PV access mode
Kubernetes PV access mode, e.g. ReadWriteOnce or ReadWriteMany.
mountPointstringrequiredMount path inside the container
Mount path inside the container (e.g. /mnt/shared).
sizeGBintegerrequiredVolume size in GiB
Volume size in gigabytes.
Each cloud provider applies a minimum volume size per storage class. A request below that floor is rejected or silently rounded up by the CSI driver, and billing is based on the provisioned size, not the requested size. The value of sizeGB must respect the minimum of the chosen storageClass.
Common minimums:
| Storage class | Provider | Minimum sizeGB |
|---|---|---|
gp2, gp3 (EBS) | AWS | 1 |
io1, io2 (EBS) | AWS | 4 |
pd-standard, pd-balanced, pd-ssd (Persistent Disk) | GCP | 10 |
standard / premium (Filestore, NFS) | GCP | 1024 / 2560 |
For example, requesting sizeGB: 1 against a pd-balanced storage class on GCP still provisions a 10 GiB disk, and a Filestore-backed class is billed for at least 1 TiB regardless of the requested value.
storageClassstringrequiredKubernetes storage class
Kubernetes storage class name (e.g. gp2, standard-rwo, ceph-rbd, cephfs, longhorn).
tensorboardobject
enabledbooleanrequiredEnable TensorBoard
Whether to run a TensorBoard sidecar for the experiment. When enabled, the TensorBoard dashboard is accessible from the experiment's page.
usePVCobjectoptionalRead logs from a PVC
When set, TensorBoard reads logs from a PVC instead of object storage. Requires spec.storage.attachExistingPVCs to be configured.
enabledbooleanrequiredUse PVC for TensorBoard logs
Whether to mount a PVC for TensorBoard log storage.
namestringoptionalPVC name from attachExistingPVCs
Name of a PVC from spec.storage.attachExistingPVCs. When omitted, the first PVC in the list is used.
timeoutstringoptionalMax duration — e.g. "1d 6h"
Maximum duration as a human-readable string. Takes precedence over activeDeadlineSeconds.
Accepts combinations of w (weeks), d (days), h (hours), m (minutes), s (seconds).
timeout: "1w 2d 5h 6m 20s"
typesobjectrequired
Defines worker pools — required field names depend on the operator. Select an operator below to see the structure of spec.types for that operator.
- Base manifest
- jax
- jobset
- kuberay
- pytorch
- xgboost
Across all operators, spec.types is required and contains the worker-pool specifications. The exact shape depends on the operator:
jax,pytorch,xgboost— singleworkerpool.jobset— named replicated jobs (free-form job keys).kuberay— oneHeadpod plus one or moreWorkerspools. For all Kubernetes-based operators (jax,pytorch,xgboost,jobset,kuberay), each pool'sresourcesaccepts the same fields — select an operator tab to see the full schema inline.
See the Jax framework page for more info.
Example spec.types section:
types:
worker:
count: 2
resources:
cpus: 16
memory: 64
accelerators:
gpu:
count: 1
product: Tesla-V100-SXM3-32GB
workerobjectrequiredSingle homogeneous worker pool
countintegeroptionalNumber of pods
Number of pods. Minimum: 0.
Default: 1
resourcesobjectrequired
acceleratorsobjectoptional
gpu.countintegeroptionalNumber of GPUs
Number of GPUs to request. Certain products enforce per-product min/max counts.
Default: 0
gpu.productstringrequired when count > 0GPU product name
Sets the nvidia.com/gpu.product node selector (e.g. Tesla-V100-SXM3-32GB, NVIDIA-H100-80GB-HBM3). The available products depend on the engine and cloud provider.
gpu.typestringoptionalGPU slice type
GPU slice type. Values: "gpu" "mig-1g.10gb" "mig-3g.20gb" "mig-3g.40gb". MIG types are only supported on products that explicitly allow them.
Default: "gpu"
tpu.topologystringrequiredTPU topology string, e.g. 2x2
TPU topology string (e.g. "2x2", "4x4"). See the GCP TPU docs.
tpu.tpuChipsCountintegeroptionalTPU chips per Kubernetes pod
Number of TPU chips per Kubernetes pod (chips per VM). Used to compute VMs for multi-host slices: VMs = topology_product / tpuChipsCount.
Default: 0
tpu.typestringrequiredTPU accelerator type — GCP GKE only
TPU accelerator type. Values: tpu-v4-podslice tpu-v5-lite-podslice tpu-v5p-slice tpu-v6e-slice
cpuLimitRatiointegeroptionalCPU limit multiplier (1 or 2)
CPU limit = cpus × cpuLimitRatio. Allowed values: 1, 2.
Setting to 1 prevents CPU bursting.
cpuLimitRatio of 2 means the CPU limit is 200% — the container is allowed to burst up to 2× the requested CPU.
Default: 2
cpusintegeroptionalCPU cores requested
Number of CPU cores requested. Minimum: 1. Certain GPU products enforce per-product min/max CPU counts.
Default: 1
extraSelectorsmapoptionalExtra node selector labels
Additional Kubernetes node selector labels applied to experiment pods. Commonly used to request spot or on-demand instances.
extraSelectors:
karpenter.sh/capacity-type: spot # AWS EKS / Azure AKS
extraTolerationsarray[object]optionalPod tolerations
Pod tolerations applied to experiment pods. This will allow the experiment pods to schedule on taint protected nodes.
[].effectstringoptionalTaint effect: NoSchedule, PreferNoSchedule, NoExecute
Taint effect to match. Empty matches all effects. Values: "NoSchedule", "PreferNoSchedule", "NoExecute".
[].keystringrequiredTaint key to tolerate
Taint key. An empty string matches all taint keys (requires operator: Exists).
[].operatorstringoptionalOperator: Equal or Exists
Relationship operator. "Exists" matches any value; "Equal" matches a specific value.
Default: "Equal"
[].tolerationSecondsintegeroptionalSeconds to tolerate NoExecute taint
For NoExecute taints only: how long to tolerate before eviction. 0 evicts immediately. Omit to tolerate indefinitely.
[].valuestringoptionalTaint value to match
Taint value to match. Must be empty when operator is "Exists".
machineNamestringoptionalTarget a specific node hostname
Sets the kubernetes.io/hostname node selector to target a specific machine.
memoryintegeroptionalMemory in GiB — mutually exclusive with ramRatio
Memory in GiB, specified directly. Mutually exclusive with ramRatio — exactly one must be set. Certain GPU products enforce min/max RAM constraints.
ramRatiointegeroptionalRAM = cpus × ramRatio GiB — mutually exclusive with memory
RAM multiplier: total memory = cpus × ramRatio GiB. Mutually exclusive with memory — exactly one must be set. Certain GPU products enforce min/max RAM constraints.
rdmaobjectoptional
devicesarray[string]requiredRDMA device names to request
List of RDMA device names. GPU nodes: sriov_a, sriov_b, sriov_c, sriov_d. CPU nodes: sriov_a, sriov_b.
shmSizeGBintegeroptionalShared memory size at /dev/shm
Shared memory size in GiB, mounted at /dev/shm.
| Value | Effect |
|---|---|
> 0 | shm volume created of that size; memory limit increased by shmSizeGB |
0 | No shm volume created |
| omitted | shm volume auto-sized to 10% of memory limit |
See the JobSet framework page for more info.
Example: job pools use free-form names that match ^[a-z0-9-]+$ (e.g. master, worker-cpu):
types:
master:
count: 1
completions: 1
parallelisms: 1
resources:
cpus: 16
ramRatio: 4
worker-cpu:
count: 2
completions: 1
parallelisms: 1
resources:
cpus: 4
ramRatio: 2
<job-name>maprequiredReplicated job specs keyed by job name
completionsintegeroptionalPods that must succeed per replica
Number of pods that must complete successfully per replica. Minimum: 0.
Default: 1
countintegeroptionalNumber of job replicas
Number of replicas of this job. Minimum: 0.
Default: 1
parallelismsintegeroptionalPods running in parallel per replica
Number of pods that run in parallel per replica. Minimum: 0.
Default: 1
resourcesobjectrequired
acceleratorsobjectoptional
gpu.countintegeroptionalNumber of GPUs
Number of GPUs to request. Certain products enforce per-product min/max counts.
Default: 0
gpu.productstringrequired when count > 0GPU product name
Sets the nvidia.com/gpu.product node selector (e.g. Tesla-V100-SXM3-32GB, NVIDIA-H100-80GB-HBM3). The available products depend on the engine and cloud provider.
gpu.typestringoptionalGPU slice type
GPU slice type. Values: "gpu" "mig-1g.10gb" "mig-3g.20gb" "mig-3g.40gb". MIG types are only supported on products that explicitly allow them.
Default: "gpu"
tpu.topologystringrequiredTPU topology string, e.g. 2x2
TPU topology string (e.g. "2x2", "4x4"). See the GCP TPU docs.
tpu.tpuChipsCountintegeroptionalTPU chips per Kubernetes pod
Number of TPU chips per Kubernetes pod (chips per VM). Used to compute VMs for multi-host slices: VMs = topology_product / tpuChipsCount.
Default: 0
tpu.typestringrequiredTPU accelerator type — GCP GKE only
TPU accelerator type. Values: tpu-v4-podslice tpu-v5-lite-podslice tpu-v5p-slice tpu-v6e-slice
cpuLimitRatiointegeroptionalCPU limit multiplier (1 or 2)
CPU limit = cpus × cpuLimitRatio. Allowed values: 1, 2.
Setting to 1 prevents CPU bursting.
cpuLimitRatio of 2 means the CPU limit is 200% — the container is allowed to burst up to 2× the requested CPU.
Default: 2
cpusintegeroptionalCPU cores requested
Number of CPU cores requested. Minimum: 1. Certain GPU products enforce per-product min/max CPU counts.
Default: 1
extraSelectorsmapoptionalExtra node selector labels
Additional Kubernetes node selector labels applied to experiment pods. Commonly used to request spot or on-demand instances.
extraSelectors:
karpenter.sh/capacity-type: spot # AWS EKS / Azure AKS
extraTolerationsarray[object]optionalPod tolerations
Pod tolerations applied to experiment pods. This will allow the experiment pods to schedule on taint protected nodes.
[].effectstringoptionalTaint effect: NoSchedule, PreferNoSchedule, NoExecute
Taint effect to match. Empty matches all effects. Values: "NoSchedule", "PreferNoSchedule", "NoExecute".
[].keystringrequiredTaint key to tolerate
Taint key. An empty string matches all taint keys (requires operator: Exists).
[].operatorstringoptionalOperator: Equal or Exists
Relationship operator. "Exists" matches any value; "Equal" matches a specific value.
Default: "Equal"
[].tolerationSecondsintegeroptionalSeconds to tolerate NoExecute taint
For NoExecute taints only: how long to tolerate before eviction. 0 evicts immediately. Omit to tolerate indefinitely.
[].valuestringoptionalTaint value to match
Taint value to match. Must be empty when operator is "Exists".
machineNamestringoptionalTarget a specific node hostname
Sets the kubernetes.io/hostname node selector to target a specific machine.
memoryintegeroptionalMemory in GiB — mutually exclusive with ramRatio
Memory in GiB, specified directly. Mutually exclusive with ramRatio — exactly one must be set. Certain GPU products enforce min/max RAM constraints.
ramRatiointegeroptionalRAM = cpus × ramRatio GiB — mutually exclusive with memory
RAM multiplier: total memory = cpus × ramRatio GiB. Mutually exclusive with memory — exactly one must be set. Certain GPU products enforce min/max RAM constraints.
rdmaobjectoptional
devicesarray[string]requiredRDMA device names to request
List of RDMA device names. GPU nodes: sriov_a, sriov_b, sriov_c, sriov_d. CPU nodes: sriov_a, sriov_b.
shmSizeGBintegeroptionalShared memory size at /dev/shm
Shared memory size in GiB, mounted at /dev/shm.
| Value | Effect |
|---|---|
> 0 | shm volume created of that size; memory limit increased by shmSizeGB |
0 | No shm volume created |
| omitted | shm volume auto-sized to 10% of memory limit |
See the KubeRay framework page for more info.
Example: one Head pod plus one or more Workers pools:
types:
Head:
resources:
cpus: 8
ramRatio: 2
rayStartParams:
object-store-memory: "1000000000" # cap the Ray object store at ~1 GB
Workers:
- name: gpu-workers
count: 2
resources:
cpus: 16
memory: 64
accelerators:
gpu:
count: 2
product: Tesla-V100-SXM3-32GB
Also accepts spec.rayVersion:
HeadobjectrequiredSingle Ray head node — no count field
rayStartParamsmapoptionalExtra ray start CLI parameters
Additional ray start CLI parameters. See the Ray CLI reference.
resourcesobjectrequired
acceleratorsobjectoptional
gpu.countintegeroptionalNumber of GPUs
Number of GPUs to request. Certain products enforce per-product min/max counts.
Default: 0
gpu.productstringrequired when count > 0GPU product name
Sets the nvidia.com/gpu.product node selector (e.g. Tesla-V100-SXM3-32GB, NVIDIA-H100-80GB-HBM3). The available products depend on the engine and cloud provider.
gpu.typestringoptionalGPU slice type
GPU slice type. Values: "gpu" "mig-1g.10gb" "mig-3g.20gb" "mig-3g.40gb". MIG types are only supported on products that explicitly allow them.
Default: "gpu"
tpu.topologystringrequiredTPU topology string, e.g. 2x2
TPU topology string (e.g. "2x2", "4x4"). See the GCP TPU docs.
tpu.tpuChipsCountintegeroptionalTPU chips per Kubernetes pod
Number of TPU chips per Kubernetes pod (chips per VM). Used to compute VMs for multi-host slices: VMs = topology_product / tpuChipsCount.
Default: 0
tpu.typestringrequiredTPU accelerator type — GCP GKE only
TPU accelerator type. Values: tpu-v4-podslice tpu-v5-lite-podslice tpu-v5p-slice tpu-v6e-slice
cpuLimitRatiointegeroptionalCPU limit multiplier (1 or 2)
CPU limit = cpus × cpuLimitRatio. Allowed values: 1, 2.
Setting to 1 prevents CPU bursting.
cpuLimitRatio of 2 means the CPU limit is 200% — the container is allowed to burst up to 2× the requested CPU.
Default: 2
cpusintegeroptionalCPU cores requested
Number of CPU cores requested. Minimum: 1. Certain GPU products enforce per-product min/max CPU counts.
Default: 1
extraSelectorsmapoptionalExtra node selector labels
Additional Kubernetes node selector labels applied to experiment pods. Commonly used to request spot or on-demand instances.
extraSelectors:
karpenter.sh/capacity-type: spot # AWS EKS / Azure AKS
extraTolerationsarray[object]optionalPod tolerations
Pod tolerations applied to experiment pods. This will allow the experiment pods to schedule on taint protected nodes.
[].effectstringoptionalTaint effect: NoSchedule, PreferNoSchedule, NoExecute
Taint effect to match. Empty matches all effects. Values: "NoSchedule", "PreferNoSchedule", "NoExecute".
[].keystringrequiredTaint key to tolerate
Taint key. An empty string matches all taint keys (requires operator: Exists).
[].operatorstringoptionalOperator: Equal or Exists
Relationship operator. "Exists" matches any value; "Equal" matches a specific value.
Default: "Equal"
[].tolerationSecondsintegeroptionalSeconds to tolerate NoExecute taint
For NoExecute taints only: how long to tolerate before eviction. 0 evicts immediately. Omit to tolerate indefinitely.
[].valuestringoptionalTaint value to match
Taint value to match. Must be empty when operator is "Exists".
machineNamestringoptionalTarget a specific node hostname
Sets the kubernetes.io/hostname node selector to target a specific machine.
memoryintegeroptionalMemory in GiB — mutually exclusive with ramRatio
Memory in GiB, specified directly. Mutually exclusive with ramRatio — exactly one must be set. Certain GPU products enforce min/max RAM constraints.
ramRatiointegeroptionalRAM = cpus × ramRatio GiB — mutually exclusive with memory
RAM multiplier: total memory = cpus × ramRatio GiB. Mutually exclusive with memory — exactly one must be set. Certain GPU products enforce min/max RAM constraints.
rdmaobjectoptional
devicesarray[string]requiredRDMA device names to request
List of RDMA device names. GPU nodes: sriov_a, sriov_b, sriov_c, sriov_d. CPU nodes: sriov_a, sriov_b.
shmSizeGBintegeroptionalShared memory size at /dev/shm
Shared memory size in GiB, mounted at /dev/shm.
| Value | Effect |
|---|---|
> 0 | shm volume created of that size; memory limit increased by shmSizeGB |
0 | No shm volume created |
| omitted | shm volume auto-sized to 10% of memory limit |
rayVersionstringoptionalRay version to deploy
Ray version to deploy. Must follow semantic versioning M.m.p.
Default: "2.23.0"
Workers[]array[object]requiredWorker pool specs — at least one
[].countintegeroptionalNumber of workers in the pool
Number of workers in the pool. Minimum: 0.
Default: 0
[].namestringrequiredWorker pool name
Name of the worker pool.
[].rayStartParamsmapoptionalExtra ray start CLI parameters
Additional ray start CLI parameters for this pool. See the Ray CLI reference.
[].resourcesobjectrequired
acceleratorsobjectoptional
gpu.countintegeroptionalNumber of GPUs
Number of GPUs to request. Certain products enforce per-product min/max counts.
Default: 0
gpu.productstringrequired when count > 0GPU product name
Sets the nvidia.com/gpu.product node selector (e.g. Tesla-V100-SXM3-32GB, NVIDIA-H100-80GB-HBM3). The available products depend on the engine and cloud provider.
gpu.typestringoptionalGPU slice type
GPU slice type. Values: "gpu" "mig-1g.10gb" "mig-3g.20gb" "mig-3g.40gb". MIG types are only supported on products that explicitly allow them.
Default: "gpu"
tpu.topologystringrequiredTPU topology string, e.g. 2x2
TPU topology string (e.g. "2x2", "4x4"). See the GCP TPU docs.
tpu.tpuChipsCountintegeroptionalTPU chips per Kubernetes pod
Number of TPU chips per Kubernetes pod (chips per VM). Used to compute VMs for multi-host slices: VMs = topology_product / tpuChipsCount.
Default: 0
tpu.typestringrequiredTPU accelerator type — GCP GKE only
TPU accelerator type. Values: tpu-v4-podslice tpu-v5-lite-podslice tpu-v5p-slice tpu-v6e-slice
cpuLimitRatiointegeroptionalCPU limit multiplier (1 or 2)
CPU limit = cpus × cpuLimitRatio. Allowed values: 1, 2.
Setting to 1 prevents CPU bursting.
cpuLimitRatio of 2 means the CPU limit is 200% — the container is allowed to burst up to 2× the requested CPU.
Default: 2
cpusintegeroptionalCPU cores requested
Number of CPU cores requested. Minimum: 1. Certain GPU products enforce per-product min/max CPU counts.
Default: 1
extraSelectorsmapoptionalExtra node selector labels
Additional Kubernetes node selector labels applied to experiment pods. Commonly used to request spot or on-demand instances.
extraSelectors:
karpenter.sh/capacity-type: spot # AWS EKS / Azure AKS
extraTolerationsarray[object]optionalPod tolerations
Pod tolerations applied to experiment pods. This will allow the experiment pods to schedule on taint protected nodes.
[].effectstringoptionalTaint effect: NoSchedule, PreferNoSchedule, NoExecute
Taint effect to match. Empty matches all effects. Values: "NoSchedule", "PreferNoSchedule", "NoExecute".
[].keystringrequiredTaint key to tolerate
Taint key. An empty string matches all taint keys (requires operator: Exists).
[].operatorstringoptionalOperator: Equal or Exists
Relationship operator. "Exists" matches any value; "Equal" matches a specific value.
Default: "Equal"
[].tolerationSecondsintegeroptionalSeconds to tolerate NoExecute taint
For NoExecute taints only: how long to tolerate before eviction. 0 evicts immediately. Omit to tolerate indefinitely.
[].valuestringoptionalTaint value to match
Taint value to match. Must be empty when operator is "Exists".
machineNamestringoptionalTarget a specific node hostname
Sets the kubernetes.io/hostname node selector to target a specific machine.
memoryintegeroptionalMemory in GiB — mutually exclusive with ramRatio
Memory in GiB, specified directly. Mutually exclusive with ramRatio — exactly one must be set. Certain GPU products enforce min/max RAM constraints.
ramRatiointegeroptionalRAM = cpus × ramRatio GiB — mutually exclusive with memory
RAM multiplier: total memory = cpus × ramRatio GiB. Mutually exclusive with memory — exactly one must be set. Certain GPU products enforce min/max RAM constraints.
rdmaobjectoptional
devicesarray[string]requiredRDMA device names to request
List of RDMA device names. GPU nodes: sriov_a, sriov_b, sriov_c, sriov_d. CPU nodes: sriov_a, sriov_b.
shmSizeGBintegeroptionalShared memory size at /dev/shm
Shared memory size in GiB, mounted at /dev/shm.
| Value | Effect |
|---|---|
> 0 | shm volume created of that size; memory limit increased by shmSizeGB |
0 | No shm volume created |
| omitted | shm volume auto-sized to 10% of memory limit |
See the PyTorch framework page for more info.
Example spec.types section:
types:
worker:
count: 4
resources:
cpus: 12
memory: 32
accelerators:
gpu:
count: 1
product: Tesla-V100-SXM3-32GB
workerobjectrequiredSingle homogeneous worker pool
countintegeroptionalNumber of pods
Number of pods. Minimum: 0.
Default: 1
resourcesobjectrequired
acceleratorsobjectoptional
gpu.countintegeroptionalNumber of GPUs
Number of GPUs to request. Certain products enforce per-product min/max counts.
Default: 0
gpu.productstringrequired when count > 0GPU product name
Sets the nvidia.com/gpu.product node selector (e.g. Tesla-V100-SXM3-32GB, NVIDIA-H100-80GB-HBM3). The available products depend on the engine and cloud provider.
gpu.typestringoptionalGPU slice type
GPU slice type. Values: "gpu" "mig-1g.10gb" "mig-3g.20gb" "mig-3g.40gb". MIG types are only supported on products that explicitly allow them.
Default: "gpu"
tpu.topologystringrequiredTPU topology string, e.g. 2x2
TPU topology string (e.g. "2x2", "4x4"). See the GCP TPU docs.
tpu.tpuChipsCountintegeroptionalTPU chips per Kubernetes pod
Number of TPU chips per Kubernetes pod (chips per VM). Used to compute VMs for multi-host slices: VMs = topology_product / tpuChipsCount.
Default: 0
tpu.typestringrequiredTPU accelerator type — GCP GKE only
TPU accelerator type. Values: tpu-v4-podslice tpu-v5-lite-podslice tpu-v5p-slice tpu-v6e-slice
cpuLimitRatiointegeroptionalCPU limit multiplier (1 or 2)
CPU limit = cpus × cpuLimitRatio. Allowed values: 1, 2.
Setting to 1 prevents CPU bursting.
cpuLimitRatio of 2 means the CPU limit is 200% — the container is allowed to burst up to 2× the requested CPU.
Default: 2
cpusintegeroptionalCPU cores requested
Number of CPU cores requested. Minimum: 1. Certain GPU products enforce per-product min/max CPU counts.
Default: 1
extraSelectorsmapoptionalExtra node selector labels
Additional Kubernetes node selector labels applied to experiment pods. Commonly used to request spot or on-demand instances.
extraSelectors:
karpenter.sh/capacity-type: spot # AWS EKS / Azure AKS
extraTolerationsarray[object]optionalPod tolerations
Pod tolerations applied to experiment pods. This will allow the experiment pods to schedule on taint protected nodes.
[].effectstringoptionalTaint effect: NoSchedule, PreferNoSchedule, NoExecute
Taint effect to match. Empty matches all effects. Values: "NoSchedule", "PreferNoSchedule", "NoExecute".
[].keystringrequiredTaint key to tolerate
Taint key. An empty string matches all taint keys (requires operator: Exists).
[].operatorstringoptionalOperator: Equal or Exists
Relationship operator. "Exists" matches any value; "Equal" matches a specific value.
Default: "Equal"
[].tolerationSecondsintegeroptionalSeconds to tolerate NoExecute taint
For NoExecute taints only: how long to tolerate before eviction. 0 evicts immediately. Omit to tolerate indefinitely.
[].valuestringoptionalTaint value to match
Taint value to match. Must be empty when operator is "Exists".
machineNamestringoptionalTarget a specific node hostname
Sets the kubernetes.io/hostname node selector to target a specific machine.
memoryintegeroptionalMemory in GiB — mutually exclusive with ramRatio
Memory in GiB, specified directly. Mutually exclusive with ramRatio — exactly one must be set. Certain GPU products enforce min/max RAM constraints.
ramRatiointegeroptionalRAM = cpus × ramRatio GiB — mutually exclusive with memory
RAM multiplier: total memory = cpus × ramRatio GiB. Mutually exclusive with memory — exactly one must be set. Certain GPU products enforce min/max RAM constraints.
rdmaobjectoptional
devicesarray[string]requiredRDMA device names to request
List of RDMA device names. GPU nodes: sriov_a, sriov_b, sriov_c, sriov_d. CPU nodes: sriov_a, sriov_b.
shmSizeGBintegeroptionalShared memory size at /dev/shm
Shared memory size in GiB, mounted at /dev/shm.
| Value | Effect |
|---|---|
> 0 | shm volume created of that size; memory limit increased by shmSizeGB |
0 | No shm volume created |
| omitted | shm volume auto-sized to 10% of memory limit |
See the XGBoost framework page for more info.
Example spec.types section:
types:
worker:
count: 3
resources:
cpus: 8
memory: 16
workerobjectrequiredSingle homogeneous worker pool
countintegeroptionalNumber of pods
Number of pods. Minimum: 0.
Default: 1
resourcesobjectrequired
acceleratorsobjectoptional
gpu.countintegeroptionalNumber of GPUs
Number of GPUs to request. Certain products enforce per-product min/max counts.
Default: 0
gpu.productstringrequired when count > 0GPU product name
Sets the nvidia.com/gpu.product node selector (e.g. Tesla-V100-SXM3-32GB, NVIDIA-H100-80GB-HBM3). The available products depend on the engine and cloud provider.
gpu.typestringoptionalGPU slice type
GPU slice type. Values: "gpu" "mig-1g.10gb" "mig-3g.20gb" "mig-3g.40gb". MIG types are only supported on products that explicitly allow them.
Default: "gpu"
tpu.topologystringrequiredTPU topology string, e.g. 2x2
TPU topology string (e.g. "2x2", "4x4"). See the GCP TPU docs.
tpu.tpuChipsCountintegeroptionalTPU chips per Kubernetes pod
Number of TPU chips per Kubernetes pod (chips per VM). Used to compute VMs for multi-host slices: VMs = topology_product / tpuChipsCount.
Default: 0
tpu.typestringrequiredTPU accelerator type — GCP GKE only
TPU accelerator type. Values: tpu-v4-podslice tpu-v5-lite-podslice tpu-v5p-slice tpu-v6e-slice
cpuLimitRatiointegeroptionalCPU limit multiplier (1 or 2)
CPU limit = cpus × cpuLimitRatio. Allowed values: 1, 2.
Setting to 1 prevents CPU bursting.
cpuLimitRatio of 2 means the CPU limit is 200% — the container is allowed to burst up to 2× the requested CPU.
Default: 2
cpusintegeroptionalCPU cores requested
Number of CPU cores requested. Minimum: 1. Certain GPU products enforce per-product min/max CPU counts.
Default: 1
extraSelectorsmapoptionalExtra node selector labels
Additional Kubernetes node selector labels applied to experiment pods. Commonly used to request spot or on-demand instances.
extraSelectors:
karpenter.sh/capacity-type: spot # AWS EKS / Azure AKS
extraTolerationsarray[object]optionalPod tolerations
Pod tolerations applied to experiment pods. This will allow the experiment pods to schedule on taint protected nodes.
[].effectstringoptionalTaint effect: NoSchedule, PreferNoSchedule, NoExecute
Taint effect to match. Empty matches all effects. Values: "NoSchedule", "PreferNoSchedule", "NoExecute".
[].keystringrequiredTaint key to tolerate
Taint key. An empty string matches all taint keys (requires operator: Exists).
[].operatorstringoptionalOperator: Equal or Exists
Relationship operator. "Exists" matches any value; "Equal" matches a specific value.
Default: "Equal"
[].tolerationSecondsintegeroptionalSeconds to tolerate NoExecute taint
For NoExecute taints only: how long to tolerate before eviction. 0 evicts immediately. Omit to tolerate indefinitely.
[].valuestringoptionalTaint value to match
Taint value to match. Must be empty when operator is "Exists".
machineNamestringoptionalTarget a specific node hostname
Sets the kubernetes.io/hostname node selector to target a specific machine.
memoryintegeroptionalMemory in GiB — mutually exclusive with ramRatio
Memory in GiB, specified directly. Mutually exclusive with ramRatio — exactly one must be set. Certain GPU products enforce min/max RAM constraints.
ramRatiointegeroptionalRAM = cpus × ramRatio GiB — mutually exclusive with memory
RAM multiplier: total memory = cpus × ramRatio GiB. Mutually exclusive with memory — exactly one must be set. Certain GPU products enforce min/max RAM constraints.
rdmaobjectoptional
devicesarray[string]requiredRDMA device names to request
List of RDMA device names. GPU nodes: sriov_a, sriov_b, sriov_c, sriov_d. CPU nodes: sriov_a, sriov_b.
shmSizeGBintegeroptionalShared memory size at /dev/shm
Shared memory size in GiB, mounted at /dev/shm.
| Value | Effect |
|---|---|
> 0 | shm volume created of that size; memory limit increased by shmSizeGB |
0 | No shm volume created |
| omitted | shm volume auto-sized to 10% of memory limit |