Skip to main content

Manifest File

Please refer to the last section of this page to leverage the auto-completion of your manifest file on your IDE.

Below is the list of fields available to build the manifest file

  • kind valid value is AIchorManifest;
  • apiVersion valid value is 0.2.2 which is the current version, apiVersion: 0.2.1 can be used as well;
  • builder docker image, dockerfile and context are specified in this section;
kind: AIchorManifest
apiVersion: 0.2.2

builder:
image: myimage
dockerfile: ./Dockerfile
context: .
...
  • spec:
    • operator valid values so far are: jax, tf and ray;
    • image docker image used;
    • command command or script to be executed;
    • tensorboard (optional for all operators);
      • enabled valid values are true or false;
    • storage inside this you can define two types of storage:
      • Persistent: mount already existing PVC(s) to all pods of the experiment;
      • Ephemeral: AIChor will create a shared volume, mount it to all pods of the experiment and it will then delete it at the end of the experiment;
        • NOTE: you need to insert your shared volume configuration under the storage key.
kind: AIchorManifest
apiVersion: 0.2.2
...
spec:
...
storage: # optional
sharedVolume: # optional
mountPoint: "/mnt/shared"
sizeGB: 16
attachExistingPVCs: # optional, array
- name: "my-awesome-pvc"
mountPoint: "/mnt/my-60tib-dataset"
...

Still under spec:

  • spec ...

    • types resources used;

      • workers this type is required for Ray whereas worker and Jax for operators but is optional for tf;

         - `count` is the number of workers;
        - `resources`:
        - `cpu` is the number of CPUs per worker;
        - `cpuLimitRatio`: default value is 2 (CPUlimit would be 200% and CPU is allowed to burst up to 2 x CPU requested). CPU limit can be fixed to the exact value requested in the manifest by specifying this value to 1 and in this case, CPU on the running pod will not exceed the requested value.
        - `ramRatio` is multiplied by the numbers of CPUs to get the RAM in GB; for example, for 2 CPUs and ramRation 3, the RAM is 2x3 = 6 GB;
        - `shmSizeGB` is optional and is an integer.

        You provide `shmSizeGB` > 0: The memory request and limit will `+= shmSize`

        Example: If you set `cpus = 16`, `ramRatio = 3` and `shmSizeGB = 10` then the memory request and limit will be set at `58G` (16\*3+10) and a shm volume of `10G` will be created and mounted.

        You provide `shmSizeGB = 0`: This will not create or mount any shm volume and the memory request and limit will remain unchanged (cpus\*ramRatio).

        You don’t provide any `shmSizeGB` (`shmSizeGB = null`): the memory request and limit will remain unchanged (cpus*ramRatio). But a shm volume will be created anyway and will have a size of 10% of the memory limit.

        Example: If you set cpus = 16, ramRatio = 3 and shmSizeGB = null then the memory request and limit will be set at 48G (16*3) and the shm (included inside memory) will have a size of 4800M(10% of 48G) shm volume is requested is mounted at /dev/shm

           - `accelerators` is optional and will contain information related to non CPU hardware, and is extendable.

        In the examples below we have a GPU example and what a TPU accelerator could look like (still not supported):

resources:
cpus: 5
ramRatio: 3

machineName: "<optional>"
shmSizeGB: 10 # optional

accelerators: # optional
gpu: # optional
count: 1
type: "gpu" # Choices can be: mig-1g.10gb, mig-3g.20gb, mig-3g.40gb
product: Tesla-V100-SXM3-32GB
  accelerators:
tpu:
type: "v4-128"
gcp:
project: "a-project-with-tpus"
zone: "us-central1-b"

In the case of tf operator

  • Any of the types TF: "PS", "Chief" or "Evaluator" can also be used with "Worker" (c.f documentation) and are optional, one of them at least has be used.

In the case of Ray operator

  • in addition to the workers, a head and a job nodes have to be specified (Worker, Head and Job are required in the manifest) in terms of resources. In the case of Ray, it is more of an array of workers (workers pool);
  • --object-store-memory (this field can only be used in 0.2.2 version) is another optional integer that allows Ray user to customize the --object-store-memory flag’s value on a specific spec.types (head or job or worker pool). Below is an example of a use of this field on a head.
kind: AIchorManifest
apiVersion: 0.2.2
...

spec:
operator: ray
...

types:
Head:
objectStoreMemorySizeGB: 3 # current field here
resources:
..

Job:
resources:
...

Workers:
- name: cpu-workers
count: 24
resources:
...

As you might have noticed, in the Ray documentation, Ray expects a value in bytes and on the manifest you provide a value scaled at GB.

In fact, AIchor will inject the flag with your value scaled at bytes. --object-store-memory=<manifest.spec.types.[head|job|workers].objectStoreMemorySizeGB * 10^9>.

Example:

If you set objectStoreMemorySizeGB: 3 then AIchor will inject the flag --object-store-memory=3000000000.

Please refer to Ray documentation for more details:

https://docs.ray.io/en/latest/ray-core/scheduling/memory-management.html

https://docs.ray.io/en/master/cluster/cli.html#cmdoption-ray-start-object-store-memory

In the case of Jax

  • the type Worker is required;
  • --object-store-memory (this field can only be used in 0.2.2 version) is another optional integer that allows Ray user to customize the --object-store-memory flag’s value on a specific spec.types (head or job or worker pool). Below is an example of a use of this field on a head.
kind: AIchorManifest
apiVersion: 0.2.2
...

spec:
operator: ray
...

types:
Head:
objectStoreMemorySizeGB: 3 # current field here
resources:
..

Job:
resources:
...

Workers:
- name: cpu-workers
count: 24
resources:
...

As you might have noticed, in the Ray documentation, Ray expects a value in bytes and on the manifest you provide a value scaled at GB.

In fact, AIchor will inject the flag with your value scaled at bytes. --object-store-memory=<manifest.spec.types.[head|job|workers].objectStoreMemorySizeGB * 10^9>.

Example:

If you set objectStoreMemorySizeGB: 3 then AIchor will inject the flag --object-store-memory=3000000000.

  • Note: Please note that the value of shmSizeGB needs to be slightly greater than objectStoreMemorySizeGB. This is due Redis consuming part of the shm but the size is not clear in the documentation. We would recommend for instance for 32 objectStoreMemorySizeGB, 34 for shmSizeGB. Please refer to Ray documentation for more details:

https://docs.ray.io/en/latest/ray-core/scheduling/memory-management.html

https://docs.ray.io/en/master/cluster/cli.html#cmdoption-ray-start-object-store-memory

Below are examples of manifests for different operators.

Examples

Ray

kind: AIchorManifest
apiVersion: 0.2.2

builder:
image: melqart
dockerfile: ./Dockerfile
context: .

spec:
operator: ray
image: melqart
command: "python train.py"
rayVersion: "v2.2"

tensorboard: # optional, disabled by default
enabled: true

storage: # optional
sharedVolume: # optional
mountPoint: "/mnt/shared"
sizeGB: 16
attachExistingPVCs: # optional, array
- name: "my-awesome-pvc"
mountPoint: "/mnt/my-60tib-dataset"

# Ray types are: Head, Job, Workers
# They are all required
# At least one worker must be set
types:
Head:
ports: [] # optional
resources:
cpus: 10
ramRatio: 2

# machineName: "dgx" # optional
shmSizeGB: 48 # optional

accelerators: # optional
gpu:
count: 2
product: Tesla-V100-SXM3-32GB
type: gpu

Job:
ports: [] # optional
resources:
cpus: 10
ramRatio: 2

# machineName: "node007" # optional
shmSizeGB: 0 # optional

Workers:
- name: "cpu-workers"
count: 2
ports: [] # optional
resources:
cpus: 1
ramRatio: 2
# machineName: "node007" # optional
shmSizeGB: 0 # optional

TF

kind: AIchorManifest
apiVersion: 0.2.2

builder:
image: ridl
dockerfile: ./Dockerfile
context: .
buildArgs:
USE_CUDA: "true"

spec:
operator: tf
image: ridl
command: python examples/train.py

tensorboard: # optional, disabled by default
enabled: true

storage: # optional
sharedVolume: # optional
mountPoint: "/mnt/shared"
sizeGB: 16
attachExistingPVCs: # optional, array
- name: "my-awesome-pvc"
mountPoint: "/mnt/my-60tib-dataset"

# at least one type is required.
# if you are not using Evaluator for example, you can set its count to 0
# or not write it at all.
# Available types are: Worker, Master, PS, Chief, Evaluator
types:
Worker:
count: 1
resources:
cpus: 20
ramRatio: 3

# machineName: "dgx" # optional
shmSizeGB: 10 # optional

accelerators: # optional
gpu:
count: 1
# options: gpu, mig-1g.10gb, mig-3g.20gb, mig-3g.40gb,
type: gpu
# options: Tesla-V100-SXM3-32GB, A100-SXM4-40GB, A100-SXM-80GB
product: Tesla-V100-SXM3-32GB

Jax

kind: AIchorManifest
apiVersion: 0.2.2

builder:
image: ridl
dockerfile: ./Dockerfile
context: .

spec:
operator: jax
image: ridl
command: python examples/train.py

tensorboard: # optional, disabled by default
enabled: true

storage: # optional
sharedVolume: # optional
mountPoint: "/mnt/shared"
sizeGB: 16
attachExistingPVCs: # optional, array
- name: "my-awesome-pvc"
mountPoint: "/mnt/my-60tib-dataset"

types:
Worker:
count: 2
resources:
cpus: 16
ramRatio: 3

# machineName: "dgx" # optional
shmSizeGB: 48 # optional

accelerators: # optional
gpu:
count: 1
product: Tesla-V100-SXM3-32GB
type: gpu

Write a manifest

In order to easily edit manifest.yaml, the AIchor Team maintains a JSON-Schema that will allow auto-completion, syntax validation and documentation self-discovery within your preferred IDE.

Also, the same schema will be used by AIchor to validate your manifest before an experiment so that it fails early when there is any misconfiguration.

All in all, we wish to lower the barrier of entry to the world of AIchor as much as possible.

alt text

You can see the raw JSON schema here (best opened in Firefox or an IDE): https://instadeep.aichor.ai/schema/latest/manifest.schema.json

  • IDE setup

    • VSCode

      • Install the YAML extension

      • Add at the top of your manifest.yaml file

        • # yaml-language-server: $schema=https://instadeep.aichor.ai/schema/latest/manifest.schema.json

        Alternatively, you can automatically assign this schema url to any file named manifest.yaml in your workspace or global settings.

    • VS Code

      • For PyCharm users, you can assign a schema in the bottom right of the screen when editing a YAML file and fill in the pop-up window as shown below.

alt text

  • Writing the manifest

    Hitting ctrl+space will pop-up suggestions to fill fields, with sometimes descriptions on VS Code.