Storage Management
In order to use data in your AIchor experiments, the platform creates for you easy to use S3-style buckets for each of your projects:
- Input bucket: named
{project-name}-{uuid}-inputs
where you should upload once your datasets or artifacts; - Output bucket: named
{project-name}-{uuid}-outputs
where AIchor will create a folder for each of your experiments for your code to write files (models, etc).
Important notes:
Note 1
When naming files, folders and buckets, some special characters are not supported.
If you use such characters, you might experience issues when managing buckets' contents.
As shown below, there is a list of recommended characters to use when naming objects which is authorized by most of third party tools and providers.
In order to have the expected behaviour from the platform, please use the following characters in file names, folder names and bucket names:
0-9
a-z
A-Z
Hyphen (-)
Underscore (_)
Period (.)
Note 2
Secret and access keys rely on STS and are created dynamically (upon request).
There is no rotation key system but instead, AIchor relies on temporary keys that expire after one hour (3600 seconds).
AWS CLI
To browse/upload/download files/objects using AWS cli:
- Install aws cli: https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html
- Run
aws configure
, then add the generated access and secret keys (leave default region name and output format empty). You can get your ACCESS and SECRET keys from the AIchor’s UI on the dataset tab. - Get the bucket id name on the AIchor’s UI
To list data in bucket, execute:
aws s3 ls s3://$BUCKET --endpoint-url https://storage.googleapis.com
To upload a file execute:
aws s3 cp file_to_be_uploaded s3://$BUCKET --endpoint-url https://storage.googleapis.com
To download a file:
aws s3 cp s3://$BUCKET/file . --endpoint-url https://storage.googleapis.com
To download a folder:
aws s3 cp s3://$BUCKET/folder . --recursive --endpoint-url https://storage.googleapis.com
In your code: S3 access
Once you upload some files in the buckets, you can access them as local files from your code thanks to the environment variables (see Environment Variables):
AICHOR_INPUT_PATH
fors3://{project-name}-{uuid}-inputs/
;AICHOR_OUTPUT_PATH
fors3://{project-name}-{uuid}-outputs/outputs/<current_experiment_id>/
;AICHOR_LOGS_PATH
fors3://{project-name}-{uuid}-outputs/logs/<experiment_id>/
(used by tensorboard).
Accessing to the buckets to read and write files using s3://...
paths can be done in various ways depending on the Python libraries you are using:
TensorFlow
For TensorFlow, you will need to additionally install the tensorflow-io
library
Example:
import tensorflow as tf
import tensorflow_io # to do in every python file involving s3 read/write
INPUT_PATH = os.environ.get("AICHOR_INPUT_PATH", "input")
OUTPUT_PATH = os.environ.get("AICHOR_OUTPUT_PATH", "output")
LOGS_PATH = os.environ.get("AICHOR_LOGS_PATH", "logs")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=LOGS_PATH, histogram_freq=1)
......
model.fit(
.......
callbacks=[
..........
tensorboard_callback
]
)
Pandas, dask
For those libraries, you need to install the s3fs
Python library for built-in support.
Example:
import os
import pandas as pd
file_path = os.path.join(os.environ["AICHOR_INPUT_PATH"]), "my_dataset.csv")
df = pd.read_csv(file_path, storage_options={"client_kwargs": {"endpoint_url": os.environ.get("S3_ENDPOINT")}})
Other
Other libraries might not have built-in support for s3://…
paths.
In that situation, the s3fs
Python library also provides you handy file-like Python objects that can then be easily used as if you were manipulating files with your libs.
Example:
import os
import numpy as np
from s3fs.core import S3FileSystem
s3 = S3FileSystem(client_kwargs={"endpoint_url": os.environ.get("S3_ENDPOINT")})
file_path = os.path.join(os.environ.get("AICHOR_INPUT_PATH"), "my_dataset.csv")
# Read the numpy filej
with s3.open(file_path) as file:
array = np.load(file)