Skip to main content

Set up your code

To get up to speed with AIchor, users would need to get their code on a VCS (GitHub or GitLab or BitBucket or Azure DevOps).

alt text alt text

Once the code is on a VCS, running experiments on the code is just 2 steps away:

  • Preparing a dockerfile
  • Configuring a yaml file: manifest.yaml

Below is an example of a basic manifest file that should be in the root of the repository:

 kind: AIchorManifest
apiVersion: 0.2.3
builder:
image: image
context: smoke-test # smoke-test folder
dockerfile: ./build/Dockerfile
spec:
operator: jobset # operator
image: image
command: "python3 -u main.py --operator=jobset --sleep=300 --tb-write=True" # command to be executed
tensorboard:
enabled: true
types:
Worker:
count: 1 # Number of workers
resources: # Minimum resources required to run the training
cpus: 1 # Number of CPU per worker
ramRatio: 2 # CPU x ramRatio = RAM in GiB, here 2 GiB
shmSizeGB: 0

Let's use an example to describe the next steps.

The example we will be referring to is this project which is stored on GitHub:

https://github.com/instadeepai/aichor-demo

Note As you may notice, the only skill required to use AIchor is to be familiar with Docker and hence have your code dockerized if needed.

Note 2 The manifest above can be found in the demo repository.

You can fork and then clone this repository as it is to get started with AIchor.

Below is a description of the code.

Sample script description

The python script executed in the repository provides 3 configurable flags:

  • the --operator flag depend on the operator you are using
  • the --sleep flag takes a number of seconds to sleep before exiting
  • the --tb-write flag takes a boolean, if True the script will write a text message to the tensorboard. The message is the commit message used to submit the experiment (VCS_COMMIT_MESSAGE).

You can edit these flags in the manifest at spec.command location.

Depending on the operator selected, the script actually does different tasks:

  • KubeRay: For the KubeRay operator the script connects to the ray cluster then it prints the list of connected nodes.

Example:

connected nodes: [{'NodeID': 'f400d0bcbcfb962ca0a7c04fcc032f102d74354a6752420a0f05907c', 'Alive': True, 'NodeManagerAddress': '10.68.4.227', 'NodeManagerHostname': 'experiment-3e21cd11-06c0-job-fwwhf', 'NodeManagerPort': 12346, 'ObjectManagerPort': 12345, 'ObjectStoreSocketName': '/tmp/ray/session_2023-07-21_02-29-33_623409_1/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2023-07-21_02-29-33_623409_1/sockets/raylet', 'MetricsExportPort': 49427, 'NodeName': '10.68.4.227', 'alive': True, 'Resources': {'job': 100000.0, 'CPU': 1.0, 'object_store_memory': 573291724.0, 'node:10.68.4.227': 1.0, 'memory': 1337680692.0}}, ...}]

  • Jax: For the Jax operator the script will print 3 env var inject by the operator:
    • JAXOPERATOR_COORDINATOR_ADDRESS
    • JAXOPERATOR_COORDINATOR_HOST
    • JAXOPERATOR_NUM_PROCESSES
    • JAXOPERATOR_PROCESS_ID

Example:

coordinator address: 0.0.0.0:1234 num processes: 1 process id: 0

More details can be found in the Read Me of the project.