Set up your code
To get up to speed with AIchor, users would need to get their code on a VCS (GitHub or GitLab or BitBucket).
Once the code is on a VCS, running experiments on the code is just 2 steps away:
- Preparing a dockerfile
- Configuring a yaml file: manifest.yaml
Let's use an example to describe the next steps.
The example we will be referring to is this project which is stored on GitHub:
https://github.com/instadeepai/aichor-demo
Note As you may notice, the only skill required to use AIchor is to be familiar with Docker and hence have you code dockerized if needed.
You can fork and then clone this repository as it is to get started with AIchor.
Below is a description of the code.
Sample script description
The python script executed in the repository provides 3 configurable flags:
- the
--operator
flag depend on the operator you are using - the
--sleep
flag takes a number of seconds to sleep before existing - the
--tb-write
flag takes a boolean, if True the script will write a text message to the tensorboard. The message is the commit message used to submit the experiment (VCS_COMMIT_MESSAGE
).
You can edit these flags in the manifest at spec.command
location.
Depending on the operator selected, the script actually does different tasks:
- KubeRay:
For the KubeRay operator the script connects to the ray cluster then it prints the list of connected nodes.
Example:
connected nodes: [{'NodeID': 'f400d0bcbcfb962ca0a7c04fcc032f102d74354a6752420a0f05907c', 'Alive': True, 'NodeManagerAddress': '10.68.4.227', 'NodeManagerHostname': 'experiment-3e21cd11-06c0-job-fwwhf', 'NodeManagerPort': 12346, 'ObjectManagerPort': 12345, 'ObjectStoreSocketName': '/tmp/ray/session_2023-07-21_02-29-33_623409_1/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2023-07-21_02-29-33_623409_1/sockets/raylet', 'MetricsExportPort': 49427, 'NodeName': '10.68.4.227', 'alive': True, 'Resources': {'job': 100000.0, 'CPU': 1.0, 'object_store_memory': 573291724.0, 'node:10.68.4.227': 1.0, 'memory': 1337680692.0}}, ...}]
- TF:
For the TF operator the script prints theTF_CONFIG
env var. Note that ifspec.Worker.count: 1
then theTF_CONFIG
env var shall not exist and script will printNone
:
Example:
tf_config: None
tf_config is None because worker count = 1
- Jax:
For the Jax operator the script will print 3 env var inject by the operator:- JAXOPERATOR_COORDINATOR_ADDRESS
- JAXOPERATOR_NUM_PROCESSES
- JAXOPERATOR_PROCESS_ID
Example:
coordinator address: 0.0.0.0:1234
num processes: 1
process id: 0
More details can be found in the Read Me of the project.