Skip to main content

Operators

In AIchor an operator is actually a Kubernetes operator, which can be seen as an injected plugin inside a Kubernetes cluster.

The main goal of the operators is to do distributed computing. Operators don't have added value in the case of running a single container.

To do distributed computing, a single experiment might need to schedule a master (or head, …) process then some workers processes and/or any other kind of processes, everything has to be interconnected, roles should be assigned to every containers, etc.

To do that we need a component that deploys everything, assigns the role of each container informs ip/port of master process, etc.

How does it work ?

Operators are containerized programs that is running inside the Kubernetes cluster. The operator is plugged in the Kubernetes API and it watches the creation of resources, then when a new resource is detected it creates the containers and passes them their role, master’s ip/port (usually through environment variables), for example with the Jax-Operator:

  • Jax-Operator detects a new jaxjob with 3 desired replicas
  • Creates the 3 replicas
    • container 0: role master, addr-coordinator 0.0.0.0 (or localhost) and port 1234
    • container 1: role worker, addr-master my-jaxjob-0 (internal routing to master container) and port 1234
    • container 2: role worker, addr-master my-jaxjob-0 (internal routing to master container) and port 1234

NOTE: In the graph below, Jax-Operator is responsible only for interactions in black, not the red ones.

alt text

KubeRay Operator

Requirements

The experiment image requires:

  • ray executable (either from base image or python package)
  • wget (apt install wget)

Using virtual envs in your image is not recommended, when installing python packages in a virtual env you need to activate this env to have the installed binaries in your PATH env var. Doing that from the Dockerfile seems like a solution but ray workers are started with bash -l which can overwrite the PATH. We recommend installing python package globally on the image (or with --system if you are using uv).

How to use the KubeRay on AIchor ?

AIchor users can use KubeRay operator by specifying it in the “operator” field on the manifest.

kind: AIchorManifest
apiVersion: 0.2.2

...
spec:
operator: kuberay
command: "python3 -u main.py"
...

types:
Head:
resources:
cpus: 1
ramRatio: 2
accelerators:
gpu:
count: 1
product: NVIDIA-A100-SXM4-80GB
type: gpu
Workers:
- name: cpu-small
count: 8
resources:
cpus: 6
ramRatio: 2
- name: cpu-big
count: 8
resources:
cpus: 12
ramRatio: 2

Available types

KubeRay Operator comes with 2 types: Head and Workers. Both required

Other components connects to the head. It hosts GCS, Ray dashboard, ...

Worker(s)

Although the head and the workers are scheduled at the same time, workers depend on the head and will need to connect to it. In this context, workers will still wait for the head to be up and running before starting their main process thanks to an init container. The operator takes care of starting ray processes on the workers and connecting them to head.

Submitter

Once head and workers are up and connected, KubeRay will spawn a third type component: the submitter.
This container shall run ray submit with your manifest's spec.command . Experiments logs' will be made available in this container.
Head and Worker should not print much logs after startup.

Setting up the distribution

The distribution is usually done by using the environment variables, it is NOT the case here.

The distribution is initialized before your script starts through the ray start commands executed on every container of the experiment, so you don’t need to interpret any environment variable to setup the distribution.

Init containers on workers produce logs like this:

8 seconds elapsed: Waiting for GCS to be ready.
15 seconds elapsed: Waiting for GCS to be ready.
25 seconds elapsed: Waiting for GCS to be ready.
116 seconds elapsed: Waiting for GCS to be ready.
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 3272, in ray._raylet.check_health
File "python/ray/_raylet.pyx", line 583, in ray._raylet.check_status
ray.exceptions.RpcError: failed to connect to all addresses; last error: UNKNOWN: ipv4:10.43.84.237:6379: Failed to connect to remote host: Connection refused
129 seconds elapsed: Still waiting for GCS to be ready. For troubleshooting, refer to the FAQ at https://github.com/ray-project/kuberay/blob/master/docs/guidance/FAQ.md.
GCS is ready. Any error messages above can be safely ignored.

Injected environment variables inside the containers

On the head container this set of environment variables are created by the operator.

CURRENT_POD_IP: (v1:status.podIP) # the ip of the pod
CPU_REQUEST: 12 # CPU requested
RAY_CLUSTER_NAME: experiment-6f4a850d-2281-raycluster-q5wqj

On workers containers this set of environment variables are created by the operator.

CURRENT_POD_IP: (v1:status.podIP) # the ip of the pod
CPU_REQUEST: 12 # CPU requested
RAY_CLUSTER_NAME=experiment-6f4a850d-2281-raycluster-q5wqj
RAY_NODE_TYPE_NAME=cpu-workers # name of workers group
KUBERAY_GEN_RAY_START_CMD=ray start ... # Ray start command for this worker
RAY_ADDRESS= ... # head address: Setting the RAY_ADDRESS env allows connecting to Ray using ray.init() when connecting from within the cluster
RAY_PORT=6379

In your code:

import ray

def main():
ray.init(address=os.environ.get("RAY_ADDRESS", "auto"))

Ray Dashboard

When your experiment is up and running (at least the head) you can access the Ray Dashboard by clicking on the dashboard link (top right of AIchor’s UI).

Explicitly spawn an actor on a particular component

KubeRay head and workers are started with some logical resources corresponding to their name.

From the example above:

  • head will be started with --resources='{"head":100000}'
  • worker group "cpu-small" will be started with --resources='{"cpu-small":100000}'
  • worker group "cpu-big" will be started with --resources='{"cpu-big":100000}'

Knowing that, here's a ray decorators requesting this logical resource:

@ray.remote(num_cpus=2, resources={"cpu-small": 1})

In the example above the actor will require 2 CPUs and 1 'cpu-small' which is a logical resource. It means that the actor will run on one of the cpu-small workers and will have 2 CPUs reserved.

A more real use case would be for example to define a parameter server as one of your worker and then use logical resource to force the actor to run on the desired worker.

Jupyter debug mode

Enable it:

...
spec:
...
debug:
jupyter: true
...

You also need to provide the Jupyter Server binary inside your experiment’s image, in this example we simply added this to our requirements.txt:

...
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.6.3
jupyter_client==8.1.0
jupyter_core==5.3.0
jupyter_server==2.5.0
jupyter_server_terminals==0.4.4
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.7
...

Then wait for your experiment to be running. In the logs you should be able to see the logs of the Jupyter server. Find and copy the token. To access the server you can click on the Jupyter button in the "View Dashboard" dropdown. Paste the token.

Additional notes:

  • The Jupyter server will run for 12 hours max. When you are done debugging your code, cancel it like a regular experiment.
  • The access to this Jupyter server is secured by AIchor SSO.
  • The Jupyter server will only be usable from the browser using the web interface. For now, the vscode plugin is not supported because of the SSO security layer.
  • When this spec.debug.jupyter is enabled, AIchor will overwrite the manifest's command with a Jupyter server start command.

Customize ray start ... params

By default, ray is started with these parameters:

Head:

--head
--resources='{"head":100000}' # custom resource that you can use to schedule your actors on specific ray node
--include-dashboard=true
--dashboard-host=0.0.0.0
--dashboard-agent-listen-port=52365
--memory=<amount of memory requested in the manifest>
--num-cpus=<amount of CPUs requested in the manifest>
--num-gpus=<amount of GPUs requested in the manifest> # unset if no GPU is requested
--log-color=true
--log-style=pretty
--block
--metrics-export-port=8080

Worker(s):

--address=<head-address:head-port> # set by the operator to connect to head
--resources='{"<name of the worker group>":100000}'
--memory=<amount of memory requested in the manifest>
--num-cpus=<amount of CPUs requested in the manifest>
--num-gpus=<amount of GPUs requested in the manifest> # unset if no GPU is requested
--log-color=true
--log-style=pretty
--block
--metrics-export-port=8080

From the manifest, ray starts params can be added (or overwrite defaults), for example:

kind: AIchorManifest
apiVersion: 0.2.2

...
spec:
...
types:
Head:
...
rayStartParams: # optional, can be empty
ray-debugger-external: "true" # allow debugger
object-store-memory: "3000000000" # https://docs.ray.io/en/master/cluster/cli.html#cmdoption-ray-start-object-store-memory
num-cpus: "0" # useful to force actors to schedule on workers
...
Workers:
- name: small-cpus
rayStartParams: # optional, can be empty
ray-debugger-external: "true" # allow debugger
object-store-memory: "1000000000"
...
- name: big-cpus
rayStartParams: # optional, can be empty
ray-debugger-external: "true" # allow debugger
object-store-memory: "5000000000"
...

Sample project

AIchor team has shared this demo project that can be cloned and used for AIchor experiments using KubeRay operator.

https://gitlab.com/instadeep/infra/ichor-demo-raytune

TensorFlow

TensorFlow (TF) is an open-source machine learning framework developed by Google Brain. It offers a comprehensive ecosystem of tools, libraries, and community resources for building and deploying machine learning models.

How to use TF on AIchor

AIchor users can use TF operator by specifying it in the “operator” field on the manifest.

spec:
operator: tf
image: efficientnet
command: "python train.py"

types:
Worker:
count: 1
resources:
cpus: 5
ramRatio: 2
shmSizeGB: 0
accelerators:
gpu:
count: 0
...

Available types

TF Operator comes with 4 types: Chief, PS (parameter server), Worker, Evaluator. A TF experiment typically contains 0 or more of the following types.

Distribution and connection

When starting an experiment with more than 1 container (sum of all types and counts) the operator injects the TF_CONFIG environment variable in all of experiment's containers. This variable contains all of addresses and ports of each of the experiment's components in a stringified json.

Example value from the PS:

{
"cluster": {
"chief": ["experiment-e1dc568c-e211-chief-0.adr-smoke-de-3b4c7b78df264d12.svc:2222"],
"ps": ["experiment-e1dc568c-e211-ps-0.adr-smoke-de-3b4c7b78df264d12.svc:2222"],
"worker": [
"experiment-e1dc568c-e211-worker-0.adr-smoke-de-3b4c7b78df264d12.svc:2222",
"experiment-e1dc568c-e211-worker-1.adr-smoke-de-3b4c7b78df264d12.svc:2222"
]
},
"task": {
"type": "ps",
"index": 0
},
"environment": "cloud"
}
  • cluster shows all of the types with their addresses and port.
  • task shows what is the current role and index, here the TF_CONFIG was retrieved from the PS container.

Sample projects

AIchor team has shared this demo project that can be cloned and used for AIchor experiments using TF operator.

https://gitlab.com/instadeep/infra/ichor-demo-efficientnet/

Documentation

The operator that manages TF experiments is the Kubeflow Training operator

operator docs
examples

XGBoost

XGBoost (Extreme Gradient Boosting) is a powerful machine learning framework for gradient boosting.

How to use XGBoost on AIchor

AIchor users can use XGBoost operator by specifying it in the “operator” field on the manifest.

spec:
operator: xgboost
image: xgboost-demo
command: "python src/train.py"

types:
Master:
count: 1
resources:
cpus: 4
ramRatio: 2
Worker:
count: 2
resources:
cpus: 8
ramRatio: 2
...

Available types

XGBoost Operator comes with 2 types: Master and Worker. At least 1 worker must be set.

Environment variables injected by the operator

The following environment variables are injected to setup the distribution between the different containers.

MASTER_PORT:       9999
MASTER_ADDR: xgboost-dist-demo-master-0
WORLD_SIZE: 3 # number of container
RANK: 1 # rank of current container (from 0 to $WORLD_SIZE - 1)
WORKER_PORT: 9999
WORKER_ADDRS: xgboost-dist-demo-worker-0,xgboost-dist-demo-worker-1 # coma separated values

Sample projects

AIchor team has shared this demo project that can be cloned and used for AIchor experiments using XGBoost operator.

https://gitlab.com/instadeep/infra/aichor/xgboost-demo

Documentation

The operator that manages XGBoost experiments is the Kubeflow Training operator

docs
more docs
examples

This project is based on this example from the official Kubeflow training operator repo even if most of the code have been removed / rewrote.

PyTorch

PyTorch is a popular open-source deep learning framework primarily developed by Facebook's AI Research lab (FAIR). It provides a flexible and dynamic approach to building neural networks and conducting deep learning research.

How to use PyTorch Operator on AIchor

AIchor users can use PyTorch operator by specifying it in the “operator” field on the manifest.

spec:
operator: pytorch
image: image
command: "python3 src/main.py"

types:
Master:
count: 1
resources:
cpus: 4
ramRatio: 2
accelerators:
gpu:
count: 1
type: gpu
product: Quadro-RTX-4000
Worker:
count: 2
resources:
cpus: 4
ramRatio: 3
shmSizeGB: 0
accelerators:
gpu:
count: 1
type: gpu
product: Quadro-RTX-4000

Available types

PyTorch Operator comes with 2 types: Master and Worker. At least 1 worker must be set.

Environment variables injected by the operator

Environment variables are injected to setup the distribution between the different containers.

MASTER_PORT:       23456
MASTER_ADDR: pytorch-dist-cifar-master-0
WORLD_SIZE: 3 # number of container
RANK: 1 # rank of container (from 0 to $WORLD_SIZE - 1)

Sample projects

AIchor team has shared this demo project that can be cloned and used for AIchor experiments using PyTorch operator.

https://gitlab.com/instadeep/infra/aichor/pytorch-demo

Documentation

The operator that manages PyTorch experiments is the Kubeflow Training operator

operator docs
pytorch docs
pytorch docs about communication
examples

This project is based on this example from the official Kubeflow training operator repo even if most of the code have been removed / rewritten.

Jax operator

JAX is an open-source library developed by Google Research that provides a way to perform numerical computing and machine learning tasks with high performance and automatic differentiation.

How to use Jax on AIchor

AIchor users can use Jax operator by specifying it in the “operator” field on the manifest.

spec:
operator: jax
image: image
command: "python3 -u main.py --operator=jax --sleep=300 --tb-write=True"

tensorboard:
enabled: true

types:
Worker:
count: 2
resources:
cpus: 1
ramRatio: 2
shmSizeGB: 0
accelerators: # optional
gpu:
count: 1
type: gpu
product: NVIDIA-A100-SXM4-80GB
...

Available types

Jax Operator comes 1 type: Worker. At least 1 worker must be set.

Environment variables injected by the operator

The following environment variables are injected to setup the distribution between the different containers.

JAXOPERATOR_COORDINATOR_ADDRESS: 0.0.0.0:1234
JAXOPERATOR_NUM_PROCESSES: 0
JAXOPERATOR_PROCESS_ID: 1

Sample projects

AIchor team has shared this demo project that can be cloned and used for AIchor experiments using Jax operator.

https://gitlab.com/instadeep/infra/aichor/smoke-test-any-operator

Documentation

Docs
Examples