Set up your code
To get up to speed with AIchor, users would need to get their code on a VCS (GitHub or GitLab or BitBucket or Azure DevOps).

Once the code is on a VCS, running experiments on the code is just 2 steps away:
- Preparing a dockerfile
- Configuring a yaml file: manifest.yaml
Below is an example of a basic manifest file that should be in the root of the repository:
kind: AIchorManifest
apiVersion: 0.2.3
builder:
image: image
context: smoke-test # smoke-test folder
dockerfile: ./build/Dockerfile
spec:
operator: jobset # operator
image: image
command: "python3 -u main.py --operator=jobset --sleep=300 --tb-write=True" # command to be executed
tensorboard:
enabled: true
types:
Worker:
count: 1 # Number of workers
resources: # Minimum resources required to run the training
cpus: 1 # Number of CPU per worker
ramRatio: 2 # CPU x ramRatio = RAM in GiB, here 2 GiB
shmSizeGB: 0
Let's use an example to describe the next steps.
The example we will be referring to is this project which is stored on GitHub:
https://github.com/instadeepai/aichor-demo
Note As you may notice, the only skill required to use AIchor is to be familiar with Docker and hence have your code dockerized if needed.
Note 2 The manifest above can be found in the demo repository.
You can fork and then clone this repository as it is to get started with AIchor.
Below is a description of the code.
Sample script description
The python script executed in the repository provides 3 configurable flags:
- the
--operatorflag depend on the operator you are using - the
--sleepflag takes a number of seconds to sleep before exiting - the
--tb-writeflag takes a boolean, if True the script will write a text message to the tensorboard. The message is the commit message used to submit the experiment (VCS_COMMIT_MESSAGE).
You can edit these flags in the manifest at spec.command location.
Depending on the operator selected, the script actually does different tasks:
- KubeRay: For the KubeRay operator the script connects to the ray cluster then it prints the list of connected nodes.
Example:
connected nodes: [{'NodeID': 'f400d0bcbcfb962ca0a7c04fcc032f102d74354a6752420a0f05907c', 'Alive': True, 'NodeManagerAddress': '10.68.4.227', 'NodeManagerHostname': 'experiment-3e21cd11-06c0-job-fwwhf', 'NodeManagerPort': 12346, 'ObjectManagerPort': 12345, 'ObjectStoreSocketName': '/tmp/ray/session_2023-07-21_02-29-33_623409_1/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2023-07-21_02-29-33_623409_1/sockets/raylet', 'MetricsExportPort': 49427, 'NodeName': '10.68.4.227', 'alive': True, 'Resources': {'job': 100000.0, 'CPU': 1.0, 'object_store_memory': 573291724.0, 'node:10.68.4.227': 1.0, 'memory': 1337680692.0}}, ...}]
- Jax:
For the Jax operator the script will print 3 env var inject by the operator:
- JAXOPERATOR_COORDINATOR_ADDRESS
- JAXOPERATOR_COORDINATOR_HOST
- JAXOPERATOR_NUM_PROCESSES
- JAXOPERATOR_PROCESS_ID
Example:
coordinator address: 0.0.0.0:1234
num processes: 1
process id: 0
More details can be found in the Read Me of the project.