Skip to main content

Operators

In AIchor an operator is actually a Kubernetes operator, which can be seen as an injected plugin inside a Kubernetes cluster.

The main goal of the operators is to do distributed computing. Operators are useless in the case of running a single container.

To do distributed computing a single experiment might need to schedule a master (or head, …) process then some workers processes and/or any other kind of processes, everything has to be interconnected, roles should be assigned to every containers, etc.

To do that we need a component that deploys everything, assigns the role of each container informs ip/port of master process, etc.

How does it work ?

Operators are containerized programs that is running inside the Kubernetes cluster. The operator is plugged in the Kubernetes API and it watches the creation of resources, then when a new resource is detected it creates the containers and passes them their role, master’s ip/port (usually through environment variables), for example with the Jax-Operator:

  • Jax-Operator detects a new jaxjob with 3 desired replicas
  • Creates the 3 replicas
    • container 0: role master, addr-coordinator 0.0.0.0 (or localhost) and port 1234
    • container 1: role worker, addr-master my-jaxjob-0 (internal routing to master container) and port 1234
    • container 2: role worker, addr-master my-jaxjob-0 (internal routing to master container) and port 1234

NOTE: In the graph below, Jax-Operator is responsible only for interactions in black, not the red ones.

alt text

Ray Operator

The Ray-Operator is developed and maintained by InstaDeep’s MLOps team.

  • How to use the Ray-Operator on AIchor ?

Example source code: https://gitlab.com/instadeep/infra/ichor-demo-raytune

All of the 3 types are required, head, job and worker, at least one worker must be set.

  • Head

Usually, this is the type on which users assign a GPU for performance reasons by keeping the parameter updater on the head as that the ray global object store. The head is also the container that hosts the Ray-Dashboard.

  • Worker(s)

The worker pools are scheduled after the head is up and running because they are dependent to the head as they have to connect to it. The operator is running a ray start ... command in the container and automatically provide the ip/port of the head to connect to it.

Under this key you are able to deploy different pools of worker:

kind: AIchorManifest
apiVersion: 0.2.2

spec:
...
types:
...
Workers:
- name: small-cpu-workers
count: 10
resources:
cpus: 8
ramRatio: 2
- name: big-cpu-workers
count: 10
resources:
cpus: 48
ramRatio: 2

We will see how to spawn a ray actor or tasks on a specific worker pool later.

  • Job

The job is connecting to the head like a normal worker and it is also the entry point of the experiment, this is where the script is executed.

The Job is scheduled after the head and all the workers are up and running because it is dependent to them as it is the component that runs the script.

  • Setting up the distribution

The distribution is usually done by using the environment variables, it is NOT the case here.

The distribution is initialized before your script starts though the ray start commands executed on every container of the experiment, so you don’t need to interpret any environment variable to setup the distribution.

The workers pods have an init-container that tries to connect to the head before starting the actual worker containers.

The job also have an init-container that tries to connect to the head before starting the actual worker containers but this one makes sure that all of the workers are connected to head.

This init phase produces logs like this:

ray_address=ray://my-super-experiment-head.my-super-project.svc.cluster.local:10001, node_count=1, expected_node_count=2
retrying in 2s
ray_address=ray://my-super-experiment-head.my-super-project.svc.cluster.local:10001, node_count=1, expected_node_count=2
retrying in 2s

In the example logs above the init container of the job pod is connected to the head and check the number of node connected to it: node_count=1, expected_node_count=2. As long as the missing worker doesn’t connect to the head it will continue to loop like this.

  • Injected environment variables inside the containers

On every container this set of environment variables are created by the operator.

CURRENT_POD_IP: (v1:status.podIP) # the ip of the pod
REDIS_PASSWORD: string # The password used by redis
RAY_SERVER: <exp-name>-head.<project>.svc.cluster.local:6379 # the addr:port to interact with the Ray Head
  • in your code:

As the job is automatically connected to the head, no parameter or any environment variable is required

import ray

def main():
ray.init()
  • Ray Dashboard

When your experiment is up and running (at least the head) you can access the Ray Dashboard by clicking on the dashboard link (top right of AIchor’s UI).

  • Supported versions of Ray

Note: the ray binary has to be installed on the image

  • 1.6.0
  • 1.7.0 1.7.2 1.7.2
  • 1.8.0
  • 1.9.2
  • 1.10.0
  • 1.11.0 1.11.1
  • 1.12.0 1.12.1
  • 1.13.0
  • 2.0.0 2.0.1
  • 2.1.0
  • 2.2.0
  • 2.3.0
  • 2.4.0
  • 2.5.0
  • 2.6.0
  • 2.7.0
  • 2.8.0
  • 2.9.0

TensorFlow

TensorFlow (TF) is an open-source machine learning framework developed by Google Brain. It offers a comprehensive ecosystem of tools, libraries, and community resources for building and deploying machine learning models. Here are some key benefits of TensorFlow:

Scalability: TensorFlow is designed for scalability and can be used to train and deploy machine learning models across a wide range of platforms, including CPUs, GPUs, and TPUs (Tensor Processing Units). This scalability makes it suitable for both small-scale experiments and large-scale production deployments.

Flexibility: TensorFlow provides a flexible and modular architecture that allows users to build a wide variety of machine learning models, including deep learning models, traditional machine learning models, and custom models. It offers high-level APIs such as Keras for easy model building as well as low-level APIs for fine-grained control over model architecture and training process.

Production Readiness: TensorFlow offers tools and APIs for deploying machine learning models in production environments, including TensorFlow Serving for serving models over RESTful APIs, TensorFlow Lite for deploying models on mobile and embedded devices, and TensorFlow.js for deploying models in web browsers.

TensorBoard: TensorFlow includes TensorBoard, a visualization toolkit for visualizing and debugging machine learning models. TensorBoard provides interactive visualizations of model graphs, training metrics, and other useful information to help users understand and optimize their models.

Community and Ecosystem: TensorFlow has a large and active community of developers, researchers, and practitioners who contribute to the development of the framework and share resources such as documentation, tutorials, and code examples. TensorFlow also integrates with other popular machine learning libraries and frameworks, such as scikit-learn, PyTorch, and Keras.

Support for Research and Education: TensorFlow is widely used in both research and education due to its extensive documentation, tutorials, and community support. It is often used in academic settings for teaching machine learning concepts and conducting research in various domains.

Performance: TensorFlow is optimized for performance and can leverage hardware accelerators such as GPUs and TPUs to speed up training and inference tasks. It also includes features such as distributed training, which allows users to train models on multiple devices or machines simultaneously for faster training times.

Overall, TensorFlow's combination of scalability, flexibility, production readiness, visualization tools, and strong community support makes it a popular choice for building and deploying machine learning models in a wide range of applications.

  • How to use TF on AIchor

AIchor users can use TF operator by specifying it in the “operator” field on the manifest.

spec:
operator: tf
image: efficientnet
command: "python train.py"

tensorboard: # optional disabled by default
enabled: true

types:
Worker:
count: 1
resources:
cpus: 5
ramRatio: 2
shmSizeGB: 0
accelerators: # optional
gpu:
count: 0
...
  • Sample projects

AIchor team has shared this demo project that can be cloned and used for AIchor experiments using TF operator.

https://gitlab.com/instadeep/infra/ichor-demo-efficientnet/

  • Documentation

The operator that manages TF experiments is the Kubeflow Training operator

operator docs
examples

XGBoost

XGBoost (Extreme Gradient Boosting) is a powerful machine learning framework that has gained widespread popularity due to several benefits it offers:

Accuracy: XGBoost is known for its high predictive accuracy. It utilizes a technique called gradient boosting, which builds multiple decision trees sequentially, with each tree correcting errors made by the previous one. This iterative approach often leads to superior performance compared to other machine learning algorithms.

Speed: XGBoost is optimized for speed and efficiency. It is implemented in C++, which makes it significantly faster than many other implementations of gradient boosting algorithms. This speed is particularly advantageous when dealing with large datasets or when training complex models.

Scalability: XGBoost can efficiently handle large datasets with a large number of features. It has mechanisms for parallel processing and distributed computing, which allow it to scale seamlessly to datasets that may not fit into memory on a single machine.

Flexibility: XGBoost supports a variety of objective functions and evaluation metrics, making it suitable for a wide range of machine learning tasks, including regression, classification, and ranking problems. It also provides options for fine-tuning model parameters, allowing users to optimize performance for specific applications.

Regularization: XGBoost includes built-in regularization techniques to prevent overfitting. Regularization helps in controlling the complexity of the learned model, thereby improving generalization performance on unseen data.

Feature Importance: XGBoost provides insights into feature importance, which can be valuable for understanding the underlying patterns in the data and for feature selection.

Active Community and Support: XGBoost has a large and active community of users and contributors. This means there are plenty of resources available, including documentation, tutorials, and community forums, making it easier for users to get help and support when needed.

Overall, the combination of accuracy, speed, scalability, flexibility, and support has made XGBoost a popular choice for machine learning practitioners across various domains.

  • How to use XGBoost on AIchor

AIchor users can use XGBoost operator by specifying it in the “operator” field on the manifest.

spec:
operator: xgboost
image: xgboost-demo
command: "python src/train.py"

tensorboard:
enabled: true

types:
Master:
count: 1
resources:
cpus: 1
ramRatio: 2
shmSizeGB: 0
Workers:
...
  • Environment variables injected by the operator

The following environment variables are injected to setup the distribution between the different containers.

MASTER_PORT:       9999
MASTER_ADDR: xgboost-dist-demo-master-0
WORLD_SIZE: 3 # number of container
RANK: 1 # rank of container (from 0 to $WORLD_SIZE - 1)
WORKER_PORT: 9999
WORKER_ADDRS: xgboost-dist-demo-worker-0,xgboost-dist-demo-worker-1 # coma separated values
  • Sample projects

AIchor team has shared this demo project that can be cloned and used for AIchor experiments using XGBoost operator.

https://gitlab.com/instadeep/infra/aichor/xgboost-demo

  • Documentation

The operator that manages XGBoost experiments is the Kubeflow Training operator

docs
more docs
examples

This project is based on this example from the official Kubeflow training operator repo even if most of the code have been removed / rewrote.

PyTorch

PyTorch is a popular open-source deep learning framework primarily developed by Facebook's AI Research lab (FAIR). It provides a flexible and dynamic approach to building neural networks and conducting deep learning research. Here are some key benefits of PyTorch:

Dynamic Computational Graphs: PyTorch uses dynamic computational graphs, meaning that the graph structure is created on-the-fly as operations are performed, rather than predefined like in static graph frameworks like TensorFlow. This dynamic nature makes it easier to debug models and write code more intuitively.

Pythonic: PyTorch is deeply integrated with Python, which makes it easy to write and debug code. Its syntax closely resembles NumPy, making it more accessible to those familiar with Python and scientific computing libraries.

Automatic Differentiation: PyTorch provides automatic differentiation capabilities through its autograd package. This allows users to compute gradients of tensors with respect to some objective function, which is crucial for training neural networks using gradient-based optimization algorithms.

Flexibility: PyTorch offers a high level of flexibility, allowing users to build and customize complex neural network architectures with ease. It supports dynamic neural networks, recurrent neural networks (RNNs), convolutional neural networks (CNNs), and more.

Large Ecosystem: PyTorch has a rich ecosystem with many libraries and tools built on top of it, such as torchvision for computer vision tasks, torchaudio for audio processing, and transformers for natural language processing (NLP). Additionally, it integrates well with other popular Python libraries such as NumPy, SciPy, and pandas.

GPU Acceleration: PyTorch supports seamless GPU acceleration, allowing users to train deep learning models efficiently on GPUs. This is particularly useful for speeding up training on large datasets and complex models.

Active Community and Support: PyTorch has a large and active community of developers and researchers, providing extensive documentation, tutorials, and online forums for support. The community also contributes to the development of PyTorch by adding new features, fixing bugs, and sharing best practices.

Overall, PyTorch's combination of flexibility, ease of use, and strong community support has made it a popular choice for both researchers and practitioners in the field of deep learning.

  • How to use PyTorch Operator on AIchor

AIchor users can use PyTorch operator by specifying it in the “operator” field on the manifest.

spec:  
operator: pytorch
image: image
command: "python3 src/main.py"

tensorboard:
enabled: true
types:
Master:
count: 1
resources:
cpus: 4
ramRatio: 3
shmSizeGB: 0
accelerators: # optional
gpu:
count: 1
type: gpu
product: Quadro-RTX-4000
Worker:
count: 2
resources:
cpus: 4
ramRatio: 3
shmSizeGB: 0
accelerators: # optional
gpu:
count: 1
type: gpu product: Quadro-RTX-4000
  • Environment variables injected by the operator

Environment variables are injected to setup the distribution between the different containers.

MASTER_PORT:       23456
MASTER_ADDR: pytorch-dist-cifar-master-0
WORLD_SIZE: 3 # number of container
RANK: 1 # rank of container (from 0 to $WORLD_SIZE - 1)
  • Sample projects

AIchor team has shared this demo project that can be cloned and used for AIchor experiments using PyTorch operator.

https://gitlab.com/instadeep/infra/aichor/pytorch-demo

  • Documentation

The operator that manages PyTorch experiments is the Kubeflow Training operator

operator docs
pytorch docs
pytorch docs about communication
examples

This project is based on this example from the official Kubeflow training operator repo even if most of the code have been removed / rewritten.

Jax operator

JAX is an open-source library developed by Google Research that provides a way to perform numerical computing and machine learning tasks with high performance and automatic differentiation. Here are some key benefits of JAX:

Automatic Differentiation: JAX provides automatic differentiation (AD) capabilities, allowing users to compute gradients of functions with respect to their inputs. This is essential for training neural networks and optimizing objective functions using gradient-based optimization algorithms.

Composable Function Transformations: JAX represents computations as pure functions and allows users to compose and transform these functions using higher-order functions. This enables users to build complex computational pipelines and apply various transformations such as batching, vectorization, and parallelization.

High Performance: JAX is built on top of the XLA (Accelerated Linear Algebra) compiler, which optimizes and accelerates numerical computations, especially on hardware accelerators such as GPUs and TPUs. This results in high performance for both training and inference tasks.

Functional Programming Paradigm: JAX promotes a functional programming paradigm, where computations are represented as pure functions without side effects. This makes it easier to reason about and debug code, as well as facilitating parallel and distributed execution.

Interoperability with NumPy and TensorFlow: JAX provides compatibility and interoperability with NumPy and TensorFlow, allowing users to seamlessly transition between these libraries. This makes it easier to leverage existing codebases and ecosystems when using JAX for numerical computing and machine learning tasks.

Support for Research and Experimentation: JAX is designed to be flexible and extensible, making it suitable for research and experimentation in machine learning and scientific computing. It provides a low-level API for building custom operations and higher-level abstractions for building and training neural networks.

Active Community and Development: JAX has a growing community of users and contributors who actively develop and maintain the library. The community provides documentation, tutorials, and code examples to help users get started with JAX and explore its capabilities.

Overall, JAX's combination of automatic differentiation, composable function transformations, high performance, and interoperability with existing libraries makes it a powerful tool for numerical computing and machine learning tasks, especially in research and experimentation settings.

  • How to use Jax on AIchor

AIchor users can use Jax operator by specifying it in the “operator” field on the manifest.

spec:
operator: jax
image: image
command: "python3 -u main.py --operator=jax --sleep=300 --tb-write=True"

tensorboard:
enabled: true

types:
Worker:
count: 2
resources:
cpus: 1
ramRatio: 2
shmSizeGB: 0
accelerators: # optional
gpu:
count: 0
type: gpu
...
  • Environment variables injected by the operator

The following environment variables are injected to setup the distribution between the different containers.

JAXOPERATOR_COORDINATOR_ADDRESS: 0.0.0.0:1234
JAXOPERATOR_NUM_PROCESSES: 0
JAXOPERATOR_PROCESS_ID: 1
  • Sample projects

AIchor team has shared this demo project that can be cloned and used for AIchor experiments using Jax operator.

https://gitlab.com/instadeep/infra/aichor/smoke-test-any-operator

  • Documentation

Docs
Examples