Experiment Skill

This skill enables Claude Code to submit and manage AIchor experiments on your behalf. Invoke it by describing what you want to do — e.g. "submit an experiment", "stream my logs", "cancel <experiment-id>".

Installing

Skills are installed by placing them in your Claude Code skills directory

Add the following files to your ~/.claude/skills/ directory so that it looks like this

~/.claude/skills/
├── aichor-experiment/
│  ├── SKILL.md
│  └── reference.md (detailed API docs - loaded when needed)
└── (other skills)

SKILL.md

---
name: aichor-experiment
description: Use when the user wants to submit experiments to the cloud, launch an experiment on AIchor, run something on GPUs, check experiment logs, monitor experiment status, cancel a cloud experiment. Handles authentication, manifest validation, CLI submission or git commit fallback, and log streaming
---

# AIchor Experiment Submission

Submit and manage experiments on the [AIchor](https://aichor.ai) cloud platform. Supports direct CLI submission as the primary path, with git-based triggering as a fallback when the CLI is unavailable. See the [AIchor documentation](https://docs.aichor.ai) for platform details.

## Prerequisites

Before using this skill, ensure:

1. **AIchor CLI is installed**: Check with `aichor --version`
2. **Environment variables are set** (in `.env` or shell):
   - `AICHOR_API_KEY` — Personal Access Token for API authentication
   - `AICHOR_PROJECT_NAME` — Name of the target AIchor project
   - `AICHOR_ENGINE_NAME` — Name of the target AIchor engine
3. **`manifest.yaml`** exists at the repository root with the experiment configuration
4. If **`manifest.yaml`** does not exist, check that a directory **`manifests/`** exists with experiment type subdirectories or/and different yaml files that can be used with the aichor experiment submit commit-sha command and the --manifest-path option
5. **Git remote** is configured (required for commit-sha submission and git fallback)

> If the CLI is not installed:

- Notify the user that the cli is not installed and ask if he wants to install it
- If he accepts notify him that the installation requires uv to be installed and ask for permission to install it
- If he also accepts install uv and then install the cli by running
  uv tool install aichor-cli --index https://aichor-python-packages.aichor.ai

> If the user refused the installation then the skill should fall back to git-based experiment triggering (see Path B below). Note that without the CLI, functionality is limited to experiment submission only — monitoring, log streaming, and experiment management all require the CLI.

## Cost and Safety Rule

**ALWAYS ask for explicit user approval before ANY experiment submission.**

Before executing ANY submission (CLI or git), you MUST:

1. Show the experiment details (script path, manifest command, resource allocation)
2. Ask: "Do you want me to trigger a cloud experiment? This will consume resources and cost money."
3. Wait for explicit "yes"
4. NEVER assume approval, even if the user says "run" or "submit"

This is non-negotiable. No exceptions.

## Workflow

### 1. Pre-flight Checks

Before submitting, read `manifest.yaml` and verify:

- `spec.command` points to the correct experiment script
- `spec.image` matches the compute profile required (CPU or GPU image)
- `spec.operator` matches the framework (e.g. `pytorch`, `tensorflow`)
- `spec.types.worker.resources` are appropriate for the workload
- Code is committed and pushed if using commit-sha submission

Also check git status:

```bash
git status
git log -1 --oneline
```

### 2. Authentication

Check if already authenticated, then auth if needed. Use `aichor auth key` for non-interactive authentication (preferred for automated workflows).

```bash
# Check current context/auth state
aichor projects list

# If not authenticated, use API key from environment:
aichor auth key --apikey $AICHOR_API_KEY

# Check if the current cli context is correct
aichor context list

# Set default project and engine by name
aichor context set project $AICHOR_PROJECT_NAME
aichor context set engine $AICHOR_ENGINE_NAME
```

> Credentials are stored in `~/.aichor/aichor_config.json`. Re-authenticate if you get permission errors.

### 3. Submit Experiment

#### Path A: AIchor CLI (preferred)

**From committed code** (recommended for reproducibility):

```bash
aichor experiments submit commit-sha $(git rev-parse HEAD) \
  --branch $(git branch --show-current)
```

**From local changes** (for quick iteration, does not require a commit):

```bash
aichor experiments submit local \
  --repo-dir . \
  --message "descriptive experiment message"
```

**Using a non-default manifest file**:
Use this when the user wants to specify a different manifest other than the default manifest.yaml.
Ensure the latest version of the manifest is pushed to the repository first.


```bash
aichor experiments submit commit-sha $(git rev-parse HEAD) \
  --branch $(git branch --show-current) \
  --manifest-path <manifest-file-name.yaml>
```

**Triggering multiple experiments each with their own manifest file**:

Use this when the user wants to trigger multiple experiments with different manifests quickly.
Ensure latest version of the mentioned manifests is pushed to the repository first.

⚠️ One note: submit the first experiment and wait for its build step to finish before submitting the rest. That way the first build populates the registry cache and the rest will be near-instant.

For each manifest <manifest_name_X> do:
```bash
aichor experiments submit commit-sha $(git rev-parse HEAD) \
  --branch $(git branch --show-current) \
  --manifest-path <manifest_name_X.yaml>
```

These commands print an experiment ID on success. Capture it from the output for monitoring.


#### Path B: Git Commit Fallback (if CLI unavailable)

If the `aichor` CLI is not installed or authentication fails, trigger experiments via git push. This path only supports experiment submission — monitoring status, streaming logs, cancelling, and resubmitting experiments are not available without the CLI. Before using this path, ask the user to confirm that their AIchor project is configured with a webhook or CI integration that triggers experiments on push.

```bash
# Stage, commit with the project's trigger prefix, and push
git add manifest.yaml <experiment-directory>/
git commit -m "exp: descriptive experiment name"
git push origin <branch>
```

> The commit prefix that triggers experiments (e.g. `exp:`, `experiment:`) depends on the project's CI/webhook configuration. Check your project's conventions.

### 4. Monitor

```bash
# Stream logs continuously from current step (blocks until experiment completes)
aichor experiments logs stream <ID>

# Stream logs from a specific pod only
aichor experiments logs stream <ID> --pod-id <POD_ID>

# Query logs from a specific step without blocking
aichor experiments logs query <ID> --step run   # clone | build | submit | run

# List experiments (paginated)
aichor experiments list
aichor experiments list --page-number 1 --page-size 20

# Get details for a specific experiment
aichor experiments list <EXPERIMENT_ID>

# Get costs for a specific experiment. Note that costs are computed daily
# so the costs will not be available if the experiment was run the same day
aichor experiments cost <EXPERIMENT_ID>

# Cancel if needed
aichor experiments cancel <EXPERIMENT_ID>

# Resubmit a previous experiment
aichor experiments resubmit <EXPERIMENT_ID>

# List pods for an experiment
aichor experiments list-pods <EXPERIMENT_ID>
```

> For long-running experiments, prefer `logs query` with polling over `logs stream`, or run `logs stream` in the background.


### 5. Summarize experiments

```bash
# Get details for a specific experiment
aichor experiments list <EXPERIMENT_ID>

# Get costs for a specific experiment. Note that costs are computed daily
# so the costs will not be available if the experiment was run the same day
aichor experiments cost <EXPERIMENT_ID>

# List pods for an experiment
aichor experiments list-pods <EXPERIMENT_ID>
```

## manifest.yaml Quick Reference

### Full Example (CPU)

```yaml
kind: AIchorManifest
apiVersion: 0.2.2

builder:
  image: <your-cpu-image>
  target: <your-cpu-image>
  dockerfile: ./Dockerfile
  context: .

spec:
  operator: pytorch  # or tensorflow, kuberay, etc.
  image: <your-cpu-image>
  command: "<run command for your experiment>"

  storage:
    sharedVolume:
      mountPoint: "/mnt/storage"
      sizeGB: 2
      storageClass: cephfs
      accessMode: ReadWriteMany

  types:
    worker:
      count: 1
      resources:
        cpus: 4
        ramRatio: 2
```

### GPU Example (additions to the above)

```yaml
# Replace the spec section with GPU-specific configuration:
spec:
  operator: pytorch
  image: <your-gpu-image>
  command: "<run command for your experiment>"

  # ... same storage config as above ...

  types:
    worker:
      count: 1
      resources:
        cpus: 16
        ramRatio: 4
        accelerators:
          gpu:
            count: 1
            product: NVIDIA-H100-80GB-HBM3
            type: gpu
```

### Key Configuration Fields

| Field | Description |
|-------|-------------|
| `spec.operator` | Framework operator (`pytorch`, `tensorflow`, `kuberay`, etc.) |
| `spec.command` | Shell command to run the experiment |
| `spec.image` | Docker image target to use |
| `spec.types.worker.resources.cpus` | Number of CPUs |
| `spec.types.worker.resources.ramRatio` | RAM multiplier (2 = standard, 4 = high memory) |
| `spec.types.worker.resources.accelerators.gpu.count` | Number of GPUs |
| `spec.types.worker.resources.accelerators.gpu.product` | GPU model |

## CLI Command Reference

### Auth & Context

| Command | Description |
|---------|-------------|
| `aichor auth key --apikey TEXT` | Non-interactive API key login |
| `aichor context list` | Show current context (project, engine) |
| `aichor context set project NAME` | Set default project by name |
| `aichor context set engine NAME` | Set default engine by name |
| `aichor context reset project` | Clear default project |
| `aichor context reset engine` | Clear default engine |
| `aichor context reset all` | Clear entire context |

### Projects & Engines

| Command | Description |
|---------|-------------|
| `aichor projects list [NAME]` | List projects or show a specific one |
| `aichor projects costs [NAME]` | Show project costs (--gpu, --cpu, --memory, --pvc, --buckets, --accelerators, --total) |
| `aichor projects linked-engines [NAME]` | List engines linked to a project |
| `aichor engines list [NAME]` | List engines or show a specific one |

### Experiments

| Command | Description |
|---------|-------------|
| `aichor experiments submit commit-sha SHA --branch BRANCH` | Submit from commit |
| `aichor experiments submit local --repo-dir DIR --message MSG` | Submit from local repo |
| `aichor experiments list [EXPERIMENT_ID]` | List experiments or show a specific one |
| `aichor experiments cost EXPERIMENT_ID` | Show experiment costs (--cpu, --memory, --accelerators, --total) |
| `aichor experiments cancel EXPERIMENT_ID` | Cancel experiment |
| `aichor experiments resubmit EXPERIMENT_ID` | Resubmit experiment |
| `aichor experiments list-pods EXPERIMENT_ID` | List pods for an experiment |
| `aichor experiments logs stream EXPERIMENT_ID` | Stream logs from current step |
| `aichor experiments logs query EXPERIMENT_ID --step STEP` | Query logs from a step |

### Storage

| Command | Description |
|---------|-------------|
| `aichor storage cloud list` | List project buckets |
| `aichor storage cloud list-contents --storage-id ID` | List contents of a bucket |
| `aichor storage cloud upload --storage-id ID --remote-path PATH` | Upload file/dir to bucket |
| `aichor storage cloud download --storage-id ID --remote-path PATH` | Download from bucket |
| `aichor storage cloud cp --src-storage-id ID --src-path P --dst-storage-id ID --dst-path P` | Copy between buckets |
| `aichor storage cloud rm --storage-id ID --path PATH` | Delete file/dir in bucket |
| `aichor storage cloud mkdir --storage-id ID --path PATH` | Create directory in bucket |
| `aichor storage cloud storage-credentials --engine-name NAME` | Get bucket credentials |

### Local Repo

| Command | Description |
|---------|-------------|
| `aichor local-repo init` | Initialize repo with templated files (interactive) |
| `aichor local-repo generate-files --file FILE` | Generate specific files (dockerfile, pyproject, manifest, etc.) |

## Troubleshooting

| Issue | Fix |
|-------|-----|
| Auth expired / permission error | `aichor auth key --apikey $AICHOR_API_KEY` |
| "project not set" error | `aichor context set project <PROJECT_NAME>` |
| "engine not set" error | `aichor context set engine <ENGINE_NAME>` |
| Wrong experiment script runs | Update `spec.command` in `manifest.yaml` |
| Build failure | Check Dockerfile and dependency files |
| OOM / resource issues | Increase `ramRatio` in `manifest.yaml` |
| CLI not installed | Ask user to install uv and run uv tool install aichor-cli --index https://aichor-python-packages.aichor.ai, or just use git commit fallback (Path B) |
| Logs not streaming | Try `aichor experiments logs query <ID> --step run` |

reference.md

## Additional resources

- For complete AIchor CLI details, see [reference.md](https://docs.aichor.ai/category/aichor-cli)