Skip to main content

PyTorch

PyTorch is a popular open-source deep learning framework developed by Facebook's AI Research lab (FAIR). It provides a flexible and dynamic approach to building neural networks and conducting deep learning research. Pick pytorch on AIchor when you want distributed training across multiple worker pods with the standard PyTorch distributed environment variables.

How to use

Select PyTorch by setting spec.operator: pytorch in your manifest. The full field-by-field specification lives in the Manifest Reference, and more complete examples are in the Manifest Examples.

spec:
operator: "pytorch"
image: "image"
command: "python3 src/main.py"

types:
worker:
count: 2
resources:
cpus: 4
ramRatio: 3
accelerators:
gpu:
count: 1
type: "gpu"
product: "Quadro-RTX-4000"

Injected environment variables

The following environment variables are injected into every worker container to set up the distribution between the different containers:

VariableDescriptionExample
MASTER_PORTPort the master listens on for the distributed rendezvous.23456
MASTER_ADDRAddress of the master container that the others connect to.pytorch-dist-cifar-master-0
WORLD_SIZETotal number of containers in the run.3
RANKRank of the current container, from 0 to WORLD_SIZE - 1.1

External documentation

Demo projects

The AIchor team has shared the following demo projects that can be cloned and used for AIchor experiments using PyTorch: