PyTorch

PyTorch is a popular open-source deep learning framework developed by Facebook's AI Research lab (FAIR). It provides a flexible and dynamic approach to building neural networks and conducting deep learning research. Pick pytorch on AIchor when you want distributed training across multiple worker pods with the standard PyTorch distributed environment variables.

How to use

Select PyTorch by setting spec.operator: pytorch in your manifest. The full field-by-field specification lives in the Manifest Reference, and more complete examples are in the Manifest Examples.

spec:
  operator: "pytorch"
  image: "image"
  command: "python3 src/main.py"

  types:
    worker:
      count: 2
      resources:
        cpus: 4
        ramRatio: 3
        accelerators:
        gpu:
          count: 1
          type: "gpu"
          product: "Quadro-RTX-4000"

Injected environment variables

The following environment variables are injected into every worker container to set up the distribution between the different containers:

Variable	Description	Example
`MASTER_PORT`	Port the master listens on for the distributed rendezvous.	`23456`
`MASTER_ADDR`	Address of the master container that the others connect to.	`pytorch-dist-cifar-master-0`
`WORLD_SIZE`	Total number of containers in the run.	`3`
`RANK`	Rank of the current container, from `0` to `WORLD_SIZE - 1`.	`1`

External documentation

Demo projects

The AIchor team has shared the following demo projects that can be cloned and used for AIchor experiments using PyTorch:

How to use​

Injected environment variables​

External documentation​

Demo projects​

How to use

Injected environment variables

External documentation

Demo projects