Maintenance and availability
In this section, the availability and maintenance operations executed in case of major disruptions are described.
AIchor healthcheck
Two main pillar on the engine side might require troubleshooting:
AIchor Project The status of a project is displayed on the UI (Admin view). If there are issues with a specific project and a troubleshoot is needed, the below details need to be checked:
- Project XRD (IaaC resource) which is InstaDeep responsibility
- Webhook is created on VCS and no errors are displayed which is the administrator responsibility (and liaise with VCS Administrator)
AIchor Engine The status of the engine is displayed on the UI. If the status is ready and there are suspicions on a specific engine, the Crossplane resource XRD related to the engine needs to be checked by InstaDeep.
AIchor highly available architecture
AIchor's architecture follows a control plane/data plane model, meaning that the control plane and data plane are largely independent of each other.
The control plane hosts central components like the UI, API, database, etc, while the data plane handles the actual processing and workload.
This separation allows for efficient scalability, flexibility, and modularity in AIchor's system design, ensuring that each component can operate and evolve with minimal interdependence.
While the control plane is a kubernetes cluster, the data plane or engine can be a kubernetes cluster or another managed service that is able to host workload execution.
Recovering AIchor services
In the event of a major incident or security breach, redeploying AIchor from Infrastructure as Code (IaaC) is designed to be swift and efficient, with the entire process typically taking up to 1 hour. This allows for rapid recovery and restoration of the system with minimal downtime.
Once the Control plane is up and running, engines can then be re-created or re-imported on the considered organizations.
The Database is backed up continuously and can be restored within 30 minutes.
AWS
On AWS, the control plane is EKS based so highly available as managed service on multi-availability zones. But even in the case of major issue on EKS, restoring the control plane lasts up to 1 hour with a restore of the database.
The above applied to EKS engine as data plane since it is the same managed service. In case of a major issue, we distinguish 2 cases:
- Imported EKS: If the EKS engine was imported on AIchor, it is the customer responsibility to re-deploy it before importing it again.
- Created EKS: If the EKS engine was created, a new engine can be deployed from AIchor UI.
In both cases, running experiments should fail de facto but, if the control plane was not affected, logs and metrics are persisted on AIchor.