Set up your code

To get up to speed with AIchor, users would need to get their code on a VCS (GitHub or GitLab or BitBucket).

alt text

Once the code is on a VCS, running experiments on the code is just 2 steps away:

Preparing a dockerfile
Configuring a yaml file: manifest.yaml

Let's use an example to describe the next steps.

The example we will be referring to is this project which is stored on GitHub:

https://github.com/instadeepai/aichor-demo

Note As you may notice, the only skill required to use AIchor is to be familiar with Docker and hence have you code dockerized if needed.

You can fork and then clone this repository as it is to get started with AIchor.

Below is a description of the code.

Sample script description

The python script executed in the repository provides 3 configurable flags:

the --operator flag depend on the operator you are using
the --sleep flag takes a number of seconds to sleep before existing
the --tb-write flag takes a boolean, if True the script will write a text message to the tensorboard. The message is the commit message used to submit the experiment (VCS_COMMIT_MESSAGE).

You can edit these flags in the manifest at spec.command location.

Depending on the operator selected, the script actually does different tasks:

Ray:
For the Ray operator the script connects to the ray cluster then it prints the list of connected nodes.

Example:

connected nodes: [{'NodeID': 'f400d0bcbcfb962ca0a7c04fcc032f102d74354a6752420a0f05907c', 'Alive': True, 'NodeManagerAddress': '10.68.4.227', 'NodeManagerHostname': 'experiment-3e21cd11-06c0-job-fwwhf', 'NodeManagerPort': 12346, 'ObjectManagerPort': 12345, 'ObjectStoreSocketName': '/tmp/ray/session_2023-07-21_02-29-33_623409_1/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2023-07-21_02-29-33_623409_1/sockets/raylet', 'MetricsExportPort': 49427, 'NodeName': '10.68.4.227', 'alive': True, 'Resources': {'job': 100000.0, 'CPU': 1.0, 'object_store_memory': 573291724.0, 'node:10.68.4.227': 1.0, 'memory': 1337680692.0}}, ...}]

TF:
For the TF operator the script prints the TF_CONFIG env var. Note that if spec.Worker.count: 1 then the TF_CONFIG env var shall not exist and script will print None:

Example:

tf_config: None
tf_config is None because worker count = 1

Jax:
For the Jax operator the script will print 3 env var inject by the operator:
- JAXOPERATOR_COORDINATOR_ADDRESS
- JAXOPERATOR_NUM_PROCESSES
- JAXOPERATOR_PROCESS_ID

Example:

coordinator address: 0.0.0.0:1234
num processes: 1
process id: 0

More details can be found in the Read Me of the project.

Set up your code

Sample script description​

Sample script description