Skip to main content

AIchor Overview

The AIchor platform integrates with any Machine Learning project hosted under a Git repository using commit webhooks. This allows AI engineers and researchers to trigger Machine Learning pipelines by pushing their code to their repository. This automatically triggers a centralized pipeline that handles all the steps required to run the experiment.

The centralized pipeline performs the following steps generically:

  • Creates an experiment entity into the database to allow for tracking;
  • Clones a fresh copy of the source code of the project and fetches the correct branch and commit;
  • Fetches and parses the experiment manifest file described by the user. This allows the pipeline to know about the artifacts to build for the experiment. The pipeline then builds the specified Docker image and pushes it to the adequate container registry;
  • The pipeline then triggers the workload following the experiment manifest provided by the user.

At this point running the experiment is delegated to different scheduling components that will get hold of the required compute resources (CPU, GPU, Memory, etc) called operators. These operators allow the platform to run a variety of Machine Learning workloads such as:

  • Kuberay-Operator: distributed reinforcement learning experiments based on Ray;
  • TFJob Operator: distributed supervised learning jobs based on TensorFlow;
  • PyTorch
  • XGBoost
  • Jax Operator.