PyTorch
PyTorch is a popular open-source deep learning framework developed by Facebook's AI Research lab (FAIR). It provides a flexible and dynamic approach to building neural networks and conducting deep learning research. Pick pytorch on AIchor when you want distributed training across multiple worker pods with the standard PyTorch distributed environment variables.
How to use
Select PyTorch by setting spec.operator: pytorch in your manifest. The full field-by-field specification lives in the Manifest Reference, and more complete examples are in the Manifest Examples.
spec:
operator: "pytorch"
image: "image"
command: "python3 src/main.py"
types:
worker:
count: 2
resources:
cpus: 4
ramRatio: 3
accelerators:
gpu:
count: 1
type: "gpu"
product: "Quadro-RTX-4000"
Injected environment variables
The following environment variables are injected into every worker container to set up the distribution between the different containers:
| Variable | Description | Example |
|---|---|---|
MASTER_PORT | Port the master listens on for the distributed rendezvous. | 23456 |
MASTER_ADDR | Address of the master container that the others connect to. | pytorch-dist-cifar-master-0 |
WORLD_SIZE | Total number of containers in the run. | 3 |
RANK | Rank of the current container, from 0 to WORLD_SIZE - 1. | 1 |
External documentation
Demo projects
The AIchor team has shared the following demo projects that can be cloned and used for AIchor experiments using PyTorch: