Operators
In AIchor an operator is actually a Kubernetes operator, which can be seen as an injected plugin inside a Kubernetes cluster.
The main goal of the operators is to do distributed computing. Operators don't have added value in the case of running a single container.
To do distributed computing, a single experiment might need to schedule a master (or head, …) process then some workers processes and/or any other kind of processes, everything has to be interconnected, roles should be assigned to every containers, etc.
To do that we need a component that deploys everything, assigns the role of each container informs ip/port of master process, etc.
How does it work ?
Operators are containerized programs that is running inside the Kubernetes cluster. The operator is plugged in the Kubernetes API and it watches the creation of resources, then when a new resource is detected it creates the containers and passes them their role, master’s ip/port (usually through environment variables), for example with the Jax-Operator:
- Jax-Operator detects a new jaxjob with 3 desired replicas
- Creates the 3 replicas
- container 0: role master, addr-coordinator 0.0.0.0 (or localhost) and port 1234
- container 1: role worker, addr-master my-jaxjob-0 (internal routing to master container) and port 1234
- container 2: role worker, addr-master my-jaxjob-0 (internal routing to master container) and port 1234
NOTE: In the graph below, Jax-Operator is responsible only for interactions in black, not the red ones.
KubeRay Operator
Requirements
The experiment image requires:
ray
executable (either from base image or python package)wget
(apt install wget
)
Using virtual envs in your image is not recommended, when installing python packages in a virtual env you need to activate
this env to have the installed binaries in your PATH
env var. Doing that from the Dockerfile seems like a solution but ray workers are started with bash -l
which can overwrite the PATH
. We recommend installing python package globally on the image (or with --system
if you are using uv
).
How to use the KubeRay on AIchor ?
AIchor users can use KubeRay operator by specifying it in the “operator” field on the manifest.
kind: AIchorManifest
apiVersion: 0.2.2
...
spec:
operator: kuberay
command: "python3 -u main.py"
...
types:
Head:
resources:
cpus: 1
ramRatio: 2
accelerators:
gpu:
count: 1
product: NVIDIA-A100-SXM4-80GB
type: gpu
Workers:
- name: cpu-small
count: 8
resources:
cpus: 6
ramRatio: 2
- name: cpu-big
count: 8
resources:
cpus: 12
ramRatio: 2
Available types
KubeRay Operator comes with 2 types: Head
and Workers
. Both required
Head
Other components connects to the head. It hosts GCS, Ray dashboard, ...
Worker(s)
Although the head and the workers are scheduled at the same time, workers depend on the head and will need to connect to it. In this context, workers will still wait for the head to be up and running before starting their main process thanks to an init container. The operator takes care of starting ray processes on the workers and connecting them to head.
Submitter
Once head and workers are up and connected, KubeRay will spawn a third type component: the submitter.
This container shall run ray submit
with your manifest's spec.command
. Experiments logs' will be made available in this container.
Head and Worker should not print much logs after startup.
Setting up the distribution
The distribution is usually done by using the environment variables, it is NOT the case here.
The distribution is initialized before your script starts through the ray start commands executed on every container of the experiment, so you don’t need to interpret any environment variable to setup the distribution.
Init containers on workers produce logs like this:
8 seconds elapsed: Waiting for GCS to be ready.
15 seconds elapsed: Waiting for GCS to be ready.
25 seconds elapsed: Waiting for GCS to be ready.
116 seconds elapsed: Waiting for GCS to be ready.
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 3272, in ray._raylet.check_health
File "python/ray/_raylet.pyx", line 583, in ray._raylet.check_status
ray.exceptions.RpcError: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.43.84.237:6379: Failed to connect to remote host: Connection refused
129 seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md.
GCS is ready. Any error messages above can be safely ignored.
Injected environment variables inside the containers
On the head container this set of environment variables are created by the operator.
CURRENT_POD_IP: (v1:status.podIP) # the ip of the pod
CPU_REQUEST: 12 # CPU requested
RAY_CLUSTER_NAME: experiment-6f4a850d-2281-raycluster-q5wqj
On workers containers this set of environment variables are created by the operator.
CURRENT_POD_IP: (v1:status.podIP) # the ip of the pod
CPU_REQUEST: 12 # CPU requested
RAY_CLUSTER_NAME=experiment-6f4a850d-2281-raycluster-q5wqj
RAY_NODE_TYPE_NAME=cpu-workers # name of workers group
KUBERAY_GEN_RAY_START_CMD=ray start ... # Ray start command for this worker
RAY_ADDRESS= ... # head address: Setting the RAY_ADDRESS env allows connecting to Ray using ray.init() when connecting from within the cluster
RAY_PORT=6379
In your code:
import ray
def main():
ray.init(address=os.environ.get("RAY_ADDRESS", "auto"))
Ray Dashboard
When your experiment is up and running (at least the head) you can access the Ray Dashboard by clicking on the dashboard link (top right of AIchor’s UI).
Explicitly spawn an actor on a particular component
KubeRay head and workers are started with some logical resources corresponding to their name.
From the example above:
- head will be started with
--resources='{"head":100000}'
- worker group "cpu-small" will be started with
--resources='{"cpu-small":100000}'
- worker group "cpu-big" will be started with
--resources='{"cpu-big":100000}'
Knowing that, here's a ray decorators requesting this logical resource:
@ray.remote(num_cpus=2, resources={"cpu-small": 1})
In the example above the actor will require 2 CPUs and 1 'cpu-small' which is a logical resource. It means that the actor will run on one of the cpu-small workers and will have 2 CPUs reserved.
A more real use case would be for example to define a parameter server as one of your worker and then use logical resource to force the actor to run on the desired worker.
Jupyter debug mode
Enable it:
...
spec:
...
debug:
jupyter: true
...
You also need to provide the Jupyter Server binary inside your experiment’s image, in this example we simply added this to our requirements.txt
:
...
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.6.3
jupyter_client==8.1.0
jupyter_core==5.3.0
jupyter_server==2.5.0
jupyter_server_terminals==0.4.4
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.7
...
Then wait for your experiment to be running. In the logs you should be able to see the logs of the Jupyter server. Find and copy the token. To access the server you can click on the Jupyter button in the "View Dashboard" dropdown. Paste the token.
Additional notes:
- The Jupyter server will run for 12 hours max. When you are done debugging your code, cancel it like a regular experiment.
- The access to this Jupyter server is secured by AIchor SSO.
- The Jupyter server will only be usable from the browser using the web interface. For now, the vscode plugin is not supported because of the SSO security layer.
- When this
spec.debug.jupyter
is enabled, AIchor will overwrite the manifest's command with a Jupyter server start command.
Customize ray start ...
params
By default, ray is started with these parameters:
Head:
--head
--resources='{"head":100000}' # custom resource that you can use to schedule your actors on specific ray node
--include-dashboard=true
--dashboard-host=0.0.0.0
--dashboard-agent-listen-port=52365
--memory=<amount of memory requested in the manifest>
--num-cpus=<amount of CPUs requested in the manifest>
--num-gpus=<amount of GPUs requested in the manifest> # unset if no GPU is requested
--log-color=true
--log-style=pretty
--block
--metrics-export-port=8080
Worker(s):
--address=<head-address:head-port> # set by the operator to connect to head
--resources='{"<name of the worker group>":100000}'
--memory=<amount of memory requested in the manifest>
--num-cpus=<amount of CPUs requested in the manifest>
--num-gpus=<amount of GPUs requested in the manifest> # unset if no GPU is requested
--log-color=true
--log-style=pretty
--block
--metrics-export-port=8080
From the manifest, ray starts params can be added (or overwrite defaults), for example:
kind: AIchorManifest
apiVersion: 0.2.2
...
spec:
...
types:
Head:
...
rayStartParams: # optional, can be empty
ray-debugger-external: "true" # allow debugger
object-store-memory: "3000000000" # https://docs.ray.io/en/master/cluster/cli.html#cmdoption-ray-start-object-store-memory
num-cpus: "0" # useful to force actors to schedule on workers
...
Workers:
- name: small-cpus
rayStartParams: # optional, can be empty
ray-debugger-external: "true" # allow debugger
object-store-memory: "1000000000"
...
- name: big-cpus
rayStartParams: # optional, can be empty
ray-debugger-external: "true" # allow debugger
object-store-memory: "5000000000"
...
Sample project
AIchor team has shared this demo project that can be cloned and used for AIchor experiments using KubeRay operator.
https://gitlab.com/instadeep/infra/ichor-demo-raytune
TensorFlow
TensorFlow (TF) is an open-source machine learning framework developed by Google Brain. It offers a comprehensive ecosystem of tools, libraries, and community resources for building and deploying machine learning models.
How to use TF on AIchor
AIchor users can use TF operator by specifying it in the “operator” field on the manifest.
spec:
operator: tf
image: efficientnet
command: "python train.py"
types:
Worker:
count: 1
resources:
cpus: 5
ramRatio: 2
shmSizeGB: 0
accelerators:
gpu:
count: 0
...
Available types
TF Operator comes with 4 types: Chief
, PS
(parameter server), Worker
, Evaluator
.
A TF experiment typically contains 0 or more of the following types.
Distribution and connection
When starting an experiment with more than 1 container (sum of all types and counts) the operator injects the TF_CONFIG
environment variable in all of experiment's containers.
This variable contains all of addresses and ports of each of the experiment's components in a stringified json.
Example value from the PS:
{
"cluster": {
"chief": ["experiment-e1dc568c-e211-chief-0.adr-smoke-de-3b4c7b78df264d12.svc:2222"],
"ps": ["experiment-e1dc568c-e211-ps-0.adr-smoke-de-3b4c7b78df264d12.svc:2222"],
"worker": [
"experiment-e1dc568c-e211-worker-0.adr-smoke-de-3b4c7b78df264d12.svc:2222",
"experiment-e1dc568c-e211-worker-1.adr-smoke-de-3b4c7b78df264d12.svc:2222"
]
},
"task": {
"type": "ps",
"index": 0
},
"environment": "cloud"
}
cluster
shows all of the types with their addresses and port.task
shows what is the current role and index, here theTF_CONFIG
was retrieved from the PS container.
Sample projects
AIchor team has shared this demo project that can be cloned and used for AIchor experiments using TF operator.
https://gitlab.com/instadeep/infra/ichor-demo-efficientnet/
Documentation
The operator that manages TF experiments is the Kubeflow Training operator
XGBoost
XGBoost (Extreme Gradient Boosting) is a powerful machine learning framework for gradient boosting.
How to use XGBoost on AIchor
AIchor users can use XGBoost operator by specifying it in the “operator” field on the manifest.
spec:
operator: xgboost
image: xgboost-demo
command: "python src/train.py"
types:
Master:
count: 1
resources:
cpus: 4
ramRatio: 2
Worker:
count: 2
resources:
cpus: 8
ramRatio: 2
...
Available types
XGBoost Operator comes with 2 types: Master
and Worker
.
At least 1 worker must be set.
Environment variables injected by the operator
The following environment variables are injected to setup the distribution between the different containers.
MASTER_PORT: 9999
MASTER_ADDR: xgboost-dist-demo-master-0
WORLD_SIZE: 3 # number of container
RANK: 1 # rank of current container (from 0 to $WORLD_SIZE - 1)
WORKER_PORT: 9999
WORKER_ADDRS: xgboost-dist-demo-worker-0,xgboost-dist-demo-worker-1 # coma separated values
Sample projects
AIchor team has shared this demo project that can be cloned and used for AIchor experiments using XGBoost operator.
https://gitlab.com/instadeep/infra/aichor/xgboost-demo
Documentation
The operator that manages XGBoost experiments is the Kubeflow Training operator
This project is based on this example from the official Kubeflow training operator repo even if most of the code have been removed / rewrote.
PyTorch
PyTorch is a popular open-source deep learning framework primarily developed by Facebook's AI Research lab (FAIR). It provides a flexible and dynamic approach to building neural networks and conducting deep learning research.
How to use PyTorch Operator on AIchor
AIchor users can use PyTorch operator by specifying it in the “operator” field on the manifest.
spec:
operator: pytorch
image: image
command: "python3 src/main.py"
types:
Master:
count: 1
resources:
cpus: 4
ramRatio: 2
accelerators:
gpu:
count: 1
type: gpu
product: Quadro-RTX-4000
Worker:
count: 2
resources:
cpus: 4
ramRatio: 3
shmSizeGB: 0
accelerators:
gpu:
count: 1
type: gpu
product: Quadro-RTX-4000
Available types
PyTorch Operator comes with 2 types: Master
and Worker
.
At least 1 worker must be set.
Environment variables injected by the operator
Environment variables are injected to setup the distribution between the different containers.
MASTER_PORT: 23456
MASTER_ADDR: pytorch-dist-cifar-master-0
WORLD_SIZE: 3 # number of container
RANK: 1 # rank of container (from 0 to $WORLD_SIZE - 1)
Sample projects
AIchor team has shared this demo project that can be cloned and used for AIchor experiments using PyTorch operator.
https://gitlab.com/instadeep/infra/aichor/pytorch-demo
Documentation
The operator that manages PyTorch experiments is the Kubeflow Training operator
operator docs
pytorch docs
pytorch docs about communication
examples
This project is based on this example from the official Kubeflow training operator repo even if most of the code have been removed / rewritten.
Jax operator
JAX is an open-source library developed by Google Research that provides a way to perform numerical computing and machine learning tasks with high performance and automatic differentiation.
How to use Jax on AIchor
AIchor users can use Jax operator by specifying it in the “operator” field on the manifest.
spec:
operator: jax
image: image
command: "python3 -u main.py --operator=jax --sleep=300 --tb-write=True"
tensorboard:
enabled: true
types:
Worker:
count: 2
resources:
cpus: 1
ramRatio: 2
shmSizeGB: 0
accelerators: # optional
gpu:
count: 1
type: gpu
product: NVIDIA-A100-SXM4-80GB
...
Available types
Jax Operator comes 1 type: Worker
.
At least 1 worker must be set.
Environment variables injected by the operator
The following environment variables are injected to setup the distribution between the different containers.
JAXOPERATOR_COORDINATOR_ADDRESS: 0.0.0.0:1234
JAXOPERATOR_NUM_PROCESSES: 0
JAXOPERATOR_PROCESS_ID: 1
Sample projects
AIchor team has shared this demo project that can be cloned and used for AIchor experiments using Jax operator.
https://gitlab.com/instadeep/infra/aichor/smoke-test-any-operator