KubeRay
KubeRay runs a Ray cluster on Kubernetes with a head node hosting the Global Control Store (GCS) and Ray dashboard, plus one or more worker pools that join the cluster on startup.
Select KubeRay on AIchor by setting spec.operator: kuberay in your manifest.yaml. AIchor schedules the head and worker pods together, starts Ray on each container with the right addresses and resources, and then runs your spec.command from a dedicated submitter pod.
Requirements
The experiment Docker image must include:
- The
rayexecutable (either from the base image or installed as a Python package). wget(apt install wget).
Using a Python virtual environment inside the image is not recommended. The activate script needs to be executed to expose their binaries on PATH. Doing this from the Dockerfile may seem to work, however, Ray workers are started with bash -l, which can overwrite PATH. Install Python packages globally instead (or with --system if you are using uv).
For example, a Dockerfile that satisfies these requirements:
- Ray from base image
- Ray as a Python package
When the base image already ships the ray executable, only wget has to be added.
FROM rayproject/ray:2.23.0
# wget is required by AIchor's KubeRay setup
RUN sudo apt-get update \
&& sudo apt-get install -y --no-install-recommends wget \
&& sudo rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY . /app
When starting from a plain image, install ray globally (not in a virtualenv) so the binaries stay on PATH when Ray workers start with bash -l.
FROM python:3.11-slim
# wget is required by AIchor's KubeRay setup
RUN apt-get update \
&& apt-get install -y --no-install-recommends wget \
&& rm -rf /var/lib/apt/lists/*
# Install Ray globally, not in a virtualenv
# RUN pip install --no-cache-dir "ray[default]==2.23.0"
# Or with uv, install into the system environment instead of a virtualenv:
# RUN uv pip install --system "ray[default]==2.23.0"
WORKDIR /app
COPY . /app
The Ray version installed in the image should match spec.rayVersion in the manifest (which defaults to 2.23.0). AIchor writes spec.rayVersion into the RayJob/RayCluster resource that the KubeRay operator reconciles, but the Ray binaries themselves come from the experiment image. The version field does not pull a Ray image, so a mismatch leaves the operator and the running Ray processes disagreeing on version-specific behaviour.
How to use
A minimal manifest selecting KubeRay:
kind: "AIchorManifest"
apiVersion: "0.2.3"
spec:
operator: "kuberay"
command: "python3 -u main.py"
types:
Head:
resources:
cpus: 1
ramRatio: 2
accelerators:
gpu:
count: 1
product: "NVIDIA-A100-SXM4-80GB"
type: "gpu"
Workers:
- name: "cpu-small"
count: 8
resources:
cpus: 6
ramRatio: 2
- name: "cpu-big"
count: 8
resources:
cpus: 12
ramRatio: 2
For more complete examples (heterogeneous GPU pools, TPU multi-host, …) see Manifest Examples and the full schema in the Manifest Reference.
Ray cluster components
Head
One pod per experiment. Other components connect to the head. Owns the Global Control Store (GCS), the Ray scheduler, and the dashboard. Nothing else in the cluster can do useful work until GCS is ready.
Workers
Each worker group in the manifest (Workers array) produces a pool of identical pods.
They are scheduled concurrently with the head, but workers depend on the head. They need to connect to it before doing anything useful. An init container on each worker waits for the head to become ready before the worker's main process starts. The KubeRay operator then takes care of starting the Ray processes on the workers and connecting them to the head.
Submitter
Once the head and workers are up and connected, KubeRay creates a Kubernetes Job called the submitter. The resulting submitter pod runs:
ray job submit --address ... --submission-id ... -- <spec.command>
KubeRay injects two environment variables into the submitter pod automatically:
| Variable | Value |
|---|---|
RAY_DASHBOARD_ADDRESS | Head service address and dashboard port |
RAY_JOB_SUBMISSION_ID | The RayJob's submission ID |
The submitter pod is not deleted when the job finishes. It is kept because it holds the experiment logs, and is only removed when the parent RayJob object is deleted. For this reason, AIchor surfaces logs from the submitter container rather than the head or worker pods, which produce little output after startup.
Setting up the distribution
Unlike most distributed frameworks on AIchor, KubeRay does not rely on environment variables for setup. The distribution is initialized before your script starts, through the ray start commands executed on every container of the experiment — so your code does not need to interpret any environment variable to bring the workers together.
The init container on each worker produces logs similar to:
8 seconds elapsed: Waiting for GCS to be ready.
15 seconds elapsed: Waiting for GCS to be ready.
25 seconds elapsed: Waiting for GCS to be ready.
116 seconds elapsed: Waiting for GCS to be ready.
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 3272, in ray._raylet.check_health
File "python/ray/_raylet.pyx", line 583, in ray._raylet.check_status
ray.exceptions.RpcError: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.43.84.237:6379: Failed to connect to remote host: Connection refused
129 seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md.
GCS is ready. Any error messages above can be safely ignored.
Injected environment variables
The head container gets:
| Variable | Description | Example |
|---|---|---|
RAY_CLUSTER_NAME | Name of the RayCluster this pod belongs to. | experiment-6f4a850d-2281-raycluster-q5wqj |
The worker containers get:
| Variable | Description | Example |
|---|---|---|
RAY_CLUSTER_NAME | Name of the RayCluster this pod belongs to. | experiment-6f4a850d-2281-raycluster-q5wqj |
RAY_NODE_TYPE_NAME | Name of the worker group this pod belongs to. | cpu-workers |
KUBERAY_GEN_RAY_START_CMD | The ray start command used to start Ray on this worker. | ray start ... |
RAY_ADDRESS | Head address. Setting RAY_ADDRESS lets ray.init() connect when called from within the cluster. | 10.43.84.237:6379 |
RAY_PORT | Port the head's GCS listens on. | 6379 |
In your code:
import os
import ray
def main():
ray.init(address=os.environ.get("RAY_ADDRESS", "auto"))
Ray dashboard
When your experiment is up (at least the head), you can open the Ray dashboard by clicking the dashboard link in the top right of AIchor's UI in the experiment page.
Explicitly spawn an actor on a particular component
KubeRay head and workers are started with logical resources matching their name. Given this manifest:
spec:
types:
Head:
resources:
cpus: 4
ramRatio: 2
Workers:
- name: "cpu-small"
count: 8
resources:
cpus: 6
ramRatio: 2
- name: "cpu-big"
count: 4
resources:
cpus: 12
ramRatio: 2
Each component is assigned a logical resource equal to its name:
- the head is started with
--resources='{"head":100000}' - worker group
cpu-smallis started with--resources='{"cpu-small":100000}' - worker group
cpu-bigis started with--resources='{"cpu-big":100000}'
100000 is intentionally set by AIchor to a large number so the custom resource never becomes the binding constraint (see default ray start params). Actual scheduling is still limited by real CPU, memory, and GPU capacity.
Target these from a @ray.remote decorator to pin actors to a specific pool:
import os
import ray
@ray.remote(num_cpus=1, resources={"head": 1})
class ParameterServer:
def __init__(self):
self.weights = initialize_weights()
def push(self, gradients):
self.weights = update(self.weights, gradients)
def pull(self):
return self.weights
@ray.remote(num_cpus=6, resources={"cpu-small": 1})
class DataWorker:
def compute_gradients(self, weights, batch):
return model_grad(weights, batch)
def main():
ray.init(address=os.environ.get("RAY_ADDRESS", "auto"))
ps = ParameterServer.remote()
workers = [DataWorker.remote() for _ in range(8)]
weights = ray.get(ps.pull.remote())
for _ in range(num_steps):
gradients = ray.get([w.compute_gradients.remote(weights, b) for w, b in zip(workers, batches)])
ps.push.remote(aggregate(gradients))
weights = ray.get(ps.pull.remote())
ParameterServer requires 1 head logical resource, so it is pinned to the head node. Each of the 8 DataWorker actors require 1 cpu-small logical resource and 6 CPUs, so each one is placed on a separate cpu-small worker pod.
Jupyter debug mode
Enable Jupyter debug mode in the manifest:
spec:
...
debug:
jupyter:
enabled: true
path: "jupyter" # optional, defaults to jupyter
You also need the Jupyter server binaries in your experiment image. For example, add the following to requirements.txt:
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.6.3
jupyter_client==8.1.0
jupyter_core==5.3.0
jupyter_server==2.5.0
jupyter_server_terminals==0.4.4
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.7
Wait for the experiment to be running. The Jupyter server's logs will appear in the experiment logs — copy the token from them. To access the server, click the Jupyter button in the "View Dashboard" dropdown and paste the token.
Notes:
- The Jupyter server runs for at most 12 hours. When you are done debugging, cancel the experiment as you would any other.
- Access is secured by AIchor SSO.
- The Jupyter server is only usable from a browser via the web interface. The VSCode plugin is not currently supported, due to the SSO security layer.
- When
spec.debug.jupyter.enabledistrue, AIchor overwrites the manifest'scommandwith a Jupyter server start command.
Customize ray start ... params
By default, Ray is started with these parameters.
Head:
--head
--resources='{"head":100000}' # custom resource you can target to schedule actors on a specific ray node
--include-dashboard=true
--dashboard-host=0.0.0.0
--dashboard-agent-listen-port=52365
--memory=<amount of memory requested in the manifest>
--num-cpus=<amount of CPUs requested in the manifest>
--num-gpus=<amount of GPUs requested in the manifest> # unset if no GPU is requested
--log-color=true
--log-style=pretty
--block
--metrics-export-port=8080
Workers:
--address=<head-address:head-port> # set by AIchor so the worker can connect to the head
--resources='{"<name of the worker group>":100000}'
--memory=<amount of memory requested in the manifest>
--num-cpus=<amount of CPUs requested in the manifest>
--num-gpus=<amount of GPUs requested in the manifest> # unset if no GPU is requested
--log-color=true
--log-style=pretty
--block
--metrics-export-port=8080
Additional ray start parameters can be added — or defaults overridden — from the manifest using rayStartParams:
kind: "AIchorManifest"
apiVersion: "0.2.3"
spec:
...
types:
Head:
...
rayStartParams: # optional, can be empty
ray-debugger-external: "true" # allow debugger
object-store-memory: "3000000000" # https://docs.ray.io/en/master/cluster/cli.html#cmdoption-ray-start-object-store-memory
num-cpus: "0" # useful to force actors to schedule on workers
...
Workers:
- name: "small-cpus"
rayStartParams: # optional, can be empty
ray-debugger-external: "true" # allow debugger
object-store-memory: "1000000000"
...
- name: "big-cpus"
rayStartParams: # optional, can be empty
ray-debugger-external: "true" # allow debugger
object-store-memory: "5000000000"
...
Demo project
The AIchor team maintains a demo project that can be cloned and used as a starting point for KubeRay experiments on AIchor: