What is AIchor?
AIchor is a platform for running Machine Learning workloads at scale. It abstracts away the management of compute and storage infrastructure and helps automate every non-AI-related task so AI engineers can focus on their experiments.
AIchor supports both custom hardware infrastructure as well as various cloud providers, allowing users to avoid vendor lock-in.
How does it work
The first step is to set up an engine, which represents the compute infrastructure where workloads run. Engines can be Kubernetes clusters or AWS ParallelCluster, either created through AIchor or imported from existing infrastructure.
Once an engine is available, projects can be attached to it. A project links a VCS repository (GitHub, GitLab, Bitbucket, or Azure DevOps) to an engine, and defines which team members have access.
With a project in place, workloads (also known as experiments) can be submitted. Experiments can be triggered in two ways:
- By pushing a commit with an
EXPorexpprefix to the linked repository - By using the AIchor CLI
When an experiment is triggered, the following steps occur on the engine: AIchor clones the repository, builds a Docker image, and schedules the workload. Users have access to logs, resource utilisation metrics, and while the experiment is running, the container itself for debugging.