Spot instances and recovering from shutdown
Spot (or preemptible) instances reduce compute cost while giving access to powerful hardware, but the cloud provider can reclaim the instance at any time and disrupt the workload. The selector configuration varies by cloud provider.
When the instance is reclaimed, the pods are evicted. See Recovering from eviction for how the operators restart them automatically.
Requesting a spot instance
AWS EKS and Azure AKS
Spot capacity is requested from the manifest:
spec:
types:
worker:
resources:
extraSelectors:
karpenter.sh/capacity-type: spot
By default, experiments request karpenter.sh/capacity-type: on-demand.
GCP GKE
On GKE, both a node selector and a toleration are required:
spec:
types:
worker:
resources:
extraSelectors:
cloud.google.com/gke-spot: "true"
cloud.google.com/gke-provisioning: spot
extraTolerations:
- key: "cloud.google.com/gke-spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
Checkpointing
When using spot instances, the most reliable way to preserve progress is to periodically save checkpoints to an external storage backend (such as an AIchor S3 bucket). On startup, the training code should detect and resume from the latest available checkpoint, ensuring reliable training on spot instances.
Graceful termination
A spot reclaim leaves only a short window before the pod is force-killed. The gracefulTermination block in th AIchor manifest uses that window to run a command, most commonly a final checkpoint save, so progress made since the last periodic checkpoint is preserved.
The command is installed as a Kubernetes preStop hook: on termination (a spot reclaim, a cancel, or any pod deletion) it runs first, and the container receives SIGTERM only once it returns. Trapping signals inside the workload is therefore not required.
spec:
gracefulTermination:
shutdownCommand: ["python", "save_checkpoint.py"]
terminationGracePeriodSeconds: 60
When the block is present, both shutdownCommand and terminationGracePeriodSeconds are required, and the grace period must be greater than 0. When it is omitted, no shutdown command runs and Kubernetes applies its default grace period of 30 seconds.
Cloud providers give only a short interruption notice for spot instances (AWS, for example, offers around 120 seconds). On a reclaim the node is drained within that window, so a terminationGracePeriodSeconds larger than the provider's notice is not fully honoured. Keeping the shutdown command well within ~120 seconds lets checkpoints complete reliably. On a normal cancel or deletion the full grace period is honoured. For more information, consult the documentation of the relevant cloud provider.
See gracefulTermination in the manifest reference for the full field definitions.