Skip to main content

Storage Management

In order to use data in your AIchor experiments, the platform creates for you easy to use S3-style buckets for each of your projects:

  • Input bucket: named {project-name}-{uuid}-inputs where you should upload once your datasets or artifacts;
  • Output bucket: named {project-name}-{uuid}-outputs where AIchor will create a folder for each of your experiments for your code to write files (models, etc).

Important notes:

Note 1
When naming files, folders and buckets, some special characters are not supported.
If you use such characters, you might experience issues when managing buckets' contents.
As shown below, there is a list of recommended characters to use when naming objects which is authorized by most of third party tools and providers.
In order to have the expected behaviour from the platform, please use the following characters in file names, folder names and bucket names:

  • 0-9

  • a-z

  • A-Z

  • Hyphen (-)

  • Underscore (_)

  • Period (.)

Note 2
Secret and access keys rely on STS and are created dynamically (upon request).
There is no rotation key system but instead, AIchor relies on temporary keys that expire after one hour (3600 seconds).

AWS CLI

To browse/upload/download files/objects using AWS cli:

To list data in bucket, execute:
aws s3 ls s3://$BUCKET --endpoint-url https://storage.googleapis.com

To upload a file execute:
aws s3 cp file_to_be_uploaded s3://$BUCKET --endpoint-url https://storage.googleapis.com

To download a file:
aws s3 cp s3://$BUCKET/file . --endpoint-url https://storage.googleapis.com

To download a folder:
aws s3 cp s3://$BUCKET/folder . --recursive --endpoint-url https://storage.googleapis.com

In your code: S3 access

Once you upload some files in the buckets, you can access them as local files from your code thanks to the environment variables (see Environment Variables):

  • AICHOR_INPUT_PATH for s3://{project-name}-{uuid}-inputs/;
  • AICHOR_OUTPUT_PATH for s3://{project-name}-{uuid}-outputs/outputs/<current_experiment_id>/;
  • AICHOR_LOGS_PATH for s3://{project-name}-{uuid}-outputs/logs/<experiment_id>/ (used by tensorboard).

Accessing to the buckets to read and write files using s3://... paths can be done in various ways depending on the Python libraries you are using:

TensorFlow

For TensorFlow, you will need to additionally install the tensorflow-io library

Example:

import tensorflow as tf
import tensorflow_io # to do in every python file involving s3 read/write

INPUT_PATH = os.environ.get("AICHOR_INPUT_PATH", "input")
OUTPUT_PATH = os.environ.get("AICHOR_OUTPUT_PATH", "output")
LOGS_PATH = os.environ.get("AICHOR_LOGS_PATH", "logs")

tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=LOGS_PATH, histogram_freq=1)

......

model.fit(
.......
callbacks=[
..........
tensorboard_callback
]
)

Pandas, dask

For those libraries, you need to install the s3fs Python library for built-in support.

Example:

import os

import pandas as pd

file_path = os.path.join(os.environ["AICHOR_INPUT_PATH"]), "my_dataset.csv")

df = pd.read_csv(file_path, storage_options={"client_kwargs": {"endpoint_url": os.environ.get("S3_ENDPOINT")}})

Other

Other libraries might not have built-in support for s3://… paths. In that situation, the s3fs Python library also provides you handy file-like Python objects that can then be easily used as if you were manipulating files with your libs.

Example:

import os

import numpy as np
from s3fs.core import S3FileSystem

s3 = S3FileSystem(client_kwargs={"endpoint_url": os.environ.get("S3_ENDPOINT")})

file_path = os.path.join(os.environ.get("AICHOR_INPUT_PATH"), "my_dataset.csv")

# Read the numpy filej
with s3.open(file_path) as file:
array = np.load(file)