Skip to main content

Bucket Operations

Operations on project buckets can be performed from the Web, the AIchor UI, via the AIchor CLI, the AWS CLI (for S3-compatible backends), or Python libraries. The AIchor CLI is the recommended approach for scripting as it handles authentication automatically.

Web

Bucket operations such as uploading, downloading, and deleting files and directories can be performed from the AIchor UI, in the Datasets tab of the experiments page.

Datasets webpage Datasets webpage

AIchor CLI

All storage commands live under aichor storage cloud. See the full command reference for all flags and options.

Get credentials

Fetch the current credentials for a project's buckets:

aichor storage cloud storage-credentials --project-name my-project

List buckets

aichor storage cloud list --project-name my-project

Browse bucket contents

aichor storage cloud list-contents --project-name my-project --storage-id <bucket-id>
aichor storage cloud list-contents --project-name my-project --storage-id <bucket-id> --path data/

Create a directory

aichor storage cloud mkdir --project-name my-project --storage-id <bucket-id> --path data/my-dataset/

Upload files

Upload a single file:

aichor storage cloud upload \
--project-name my-project \
--storage-id <bucket-id> \
--src path/to/local/file.csv \
--dst data/file.csv

Upload a directory recursively:

aichor storage cloud upload \
--project-name my-project \
--storage-id <bucket-id> \
--src path/to/local/dir/ \
--dst data/my-dataset/ \
--recursive

Download files

aichor storage cloud download \
--project-name my-project \
--storage-id <bucket-id> \
--src data/file.csv \
--dst ./local-output/

aichor storage cloud download \
--project-name my-project \
--storage-id <bucket-id> \
--src results/ \
--dst ./local-output/ \
--recursive

Copy between buckets

aichor storage cloud cp \
--project-name my-project \
--src-storage-id <input-bucket-id> \
--src-path data/file.csv \
--dst-storage-id <output-bucket-id> \
--dst-path archive/file.csv

Delete files

Use --dry-run to preview what will be deleted before committing:

aichor storage cloud rm \
--project-name my-project \
--storage-id <bucket-id> \
--path old-data/ \
--recursive \
--dry-run

# Remove the --dry-run flag to execute
aichor storage cloud rm \
--project-name my-project \
--storage-id <bucket-id> \
--path old-data/ \
--recursive

For a worked end-to-end example covering all common operations, see the Storage Operations example script.


AWS CLI

The AWS CLI works with all S3-compatible backends (GCP, AWS, and on-premises Ceph). Credentials must first be retrieved via the AIchor CLI or web interface and configured via aws configure.

Install the AWS CLI

Run aws configure and enter the access key and secret key. Leave the default region and output format empty. The bucket ID and endpoint URL are available from the Dataset tab in the AIchor web interface.

List objects in a bucket:

aws s3 ls s3://$BUCKET --endpoint-url $AWS_ENDPOINT_URL

Upload a file:

aws s3 cp file_to_upload.csv s3://$BUCKET/data/ --endpoint-url $AWS_ENDPOINT_URL

Download a file:

aws s3 cp s3://$BUCKET/data/file.csv . --endpoint-url $AWS_ENDPOINT_URL

Download a folder:

aws s3 cp s3://$BUCKET/data/ . --recursive --endpoint-url $AWS_ENDPOINT_URL

Python libraries

Inside experiments, the AICHOR_INPUT_PATH, AICHOR_OUTPUT_PATH, and AWS_ENDPOINT_URL environment variables are available for use directly in code.

TensorFlow

Install the tensorflow-io library for S3 support:

import os
import tensorflow as tf
import tensorflow_io # required in every file that reads/writes S3

INPUT_PATH = os.environ.get("AICHOR_INPUT_PATH", "input")
OUTPUT_PATH = os.environ.get("AICHOR_OUTPUT_PATH", "output")
LOGS_PATH = os.environ.get("AICHOR_TENSORBOARD_PATH", "logs")

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=LOGS_PATH, histogram_freq=1)

model.fit(
...,
callbacks=[tensorboard_callback]
)

Pandas and Dask

Both libraries support S3 natively via s3fs. Install s3fs for built-in support:

import os
import pandas as pd

file_path = os.path.join(os.environ["AICHOR_INPUT_PATH"], "my_dataset.csv")

df = pd.read_csv(
file_path,
storage_options={"client_kwargs": {"endpoint_url": os.environ.get("AWS_ENDPOINT_URL")}}
)

Other libraries (s3fs)

For libraries that do not have built-in S3 support, s3fs provides file-like Python objects:

import os
import numpy as np
from s3fs.core import S3FileSystem

s3 = S3FileSystem(client_kwargs={"endpoint_url": os.environ.get("AWS_ENDPOINT_URL")})

file_path = os.path.join(os.environ.get("AICHOR_INPUT_PATH"), "my_dataset.npy")

with s3.open(file_path) as f:
array = np.load(f)