S3 - Accessing data in S3 quickly
The S3
client is a wrapper over the standard AWS Python library, boto
. It contains enhancements that are relevant for data-intensive applications:
- Supports accessing large amounts of data quickly through parallel operations (functions with the
_many
suffix). You can download up to 20Gbps on a large EC2 instance. - Improved error handling.
- Supports versioned data through
S3(run=self)
andS3(run=Run)
. - User-friendly API with minimal boilerplate.
- Convenient API for advanced featured such as range requests (downloading partial files) and object headers.
For instructions how to use the class, see Loading and Storing Data.
The S3
client
from metaflow import S3
The Metaflow S3 client.
This object manages the connection to S3 and a temporary diretory that is used to download objects. Note that in most cases when the data fits in memory, no local disk IO is needed as operations are cached by the operating system, which makes operations fast as long as there is enough memory available.
The easiest way is to use this object as a context manager:
with S3() as s3:
data = [obj.blob for obj in s3.get_many(urls)]
print(data)
The context manager takes care of creating and deleting a temporary directory
automatically. Without a context manager, you must call .close()
to delete
the directory explicitly:
s3 = S3()
data = [obj.blob for obj in s3.get_many(urls)]
s3.close()
You can customize the location of the temporary directory with tmproot
. It
defaults to the current working directory.
To make it easier to deal with object locations, the client can be initialized with an S3 path prefix. There are three ways to handle locations:
-
Use a
metaflow.Run
object orself
, e.g.S3(run=self)
which initializes the prefix with the globalDATATOOLS_S3ROOT
path, combined with the current run ID. This mode makes it easy to version data based on the run ID consistently. You can use thebucket
andprefix
to override parts ofDATATOOLS_S3ROOT
. -
Specify an S3 prefix explicitly with
s3root
, e.g.S3(s3root='s3://mybucket/some/path')
. -
Specify nothing, i.e.
S3()
, in which case all operations require a full S3 url prefixed withs3://
.
tmproot: str, default: '.'
Where to store the temporary directory.
bucket: str, optional
Override the bucket from DATATOOLS_S3ROOT
when run
is specified.
prefix: str, optional
Override the path from DATATOOLS_S3ROOT
when run
is specified.
run: FlowSpec or Run, optional
Derive path prefix from the current or a past run ID, e.g. S3(run=self).
s3root: str, optional
If run
is not specified, use this as the S3 prefix.
Downloading data
Get a single object from S3.
key: Union[str, S3GetObject], optional, default None
Object to download. It can be an S3 url, a path suffix, or an S3GetObject that defines a range of data to download. If None, or not provided, gets the S3 root.
return_missing: bool, default False
If set to True, do not raise an exception for a missing key but
return it as an S3Object
with .exists == False
.
return_info: bool, default True
If set to True, fetch the content-type and user metadata associated
with the object at no extra cost, included for symmetry with get_many
S3Object
An S3Object corresponding to the object requested.
Get many objects from S3 in parallel.
keys: Iterable[Union[str, S3GetObject]]
Objects to download. Each object can be an S3 url, a path suffix, or an S3GetObject that defines a range of data to download.
return_missing: bool, default False
If set to True, do not raise an exception for a missing key but
return it as an S3Object
with .exists == False
.
return_info: bool, default True
If set to True, fetch the content-type and user metadata associated
with the object at no extra cost, included for symmetry with get_many
.
List[S3Object]
S3Objects corresponding to the objects requested.
Get many objects from S3 recursively in parallel.
keys: Iterable[str]
Prefixes to download recursively. Each prefix can be an S3 url or a path suffix which define the root prefix under which all objects are downloaded.
return_info: bool, default False
If set to True, fetch the content-type and user metadata associated with the object.
List[S3Object]
S3Objects stored under the given prefixes.
Get all objects under the prefix set in the S3
constructor.
This method requires that the S3
object is initialized either with run
or
s3root
.
return_info: bool, default False
If set to True, fetch the content-type and user metadata associated with the object.
Iterable[S3Object]
S3Objects stored under the main prefix.
Listing objects
List the next level of paths in S3.
If multiple keys are specified, listings are done in parallel. The returned
S3Objects have .exists == False
if the path refers to a prefix, not an
existing S3 object.
For instance, if the directory hierarchy is
a/0.txt
a/b/1.txt
a/c/2.txt
a/d/e/3.txt
f/4.txt
The list_paths(['a', 'f'])
call returns
a/0.txt (exists == True)
a/b/ (exists == False)
a/c/ (exists == False)
a/d/ (exists == False)
f/4.txt (exists == True)
keys: Iterable[str], optional, default None
List of paths.
List[S3Object]
S3Objects under the given paths, including prefixes (directories) that do not correspond to leaf objects.
List all objects recursively under the given prefixes.
If multiple keys are specified, listings are done in parallel. All objects
returned have .exists == True
as this call always returns leaf objects.
For instance, if the directory hierarchy is
a/0.txt
a/b/1.txt
a/c/2.txt
a/d/e/3.txt
f/4.txt
The list_paths(['a', 'f'])
call returns
a/0.txt (exists == True)
a/b/1.txt (exists == True)
a/c/2.txt (exists == True)
a/d/e/3.txt (exists == True)
f/4.txt (exists == True)
keys: Iterable[str], optional, default None
List of paths.
List[S3Object]
S3Objects under the given paths.
Uploading data
Upload a single object to S3.
key: Union[str, S3PutObject]
Object path. It can be an S3 url or a path suffix.
obj: PutValue
An object to store in S3. Strings are converted to UTF-8 encoding.
overwrite: bool, default True
Overwrite the object if it exists. If set to False, the operation succeeds without uploading anything if the key already exists.
content_type: str, optional, default None
Optional MIME type for the object.
metadata: Dict[str, str], optional, default None
A JSON-encodable dictionary of additional headers to be stored as metadata with the object.
str
URL of the object stored.
Upload many objects to S3.
Each object to be uploaded can be specified in two ways:
-
As a
(key, obj)
tuple wherekey
is a string specifying the path andobj
is a string or a bytes object. -
As a
S3PutObject
which contains additional metadata to be stored with the object.
key_objs: List[Union[Tuple[str, PutValue], S3PutObject]]
List of key-object pairs to upload.
overwrite: bool, default True
Overwrite the object if it exists. If set to False, the operation succeeds without uploading anything if the key already exists.
List[Tuple[str, str]]
List of (key, url)
pairs corresponding to the objects uploaded.
Upload many local files to S3.
Each file to be uploaded can be specified in two ways:
-
As a
(key, path)
tuple wherekey
is a string specifying the S3 path andpath
is the path to a local file. -
As a
S3PutObject
which contains additional metadata to be stored with the file.
key_paths: List[Union[Tuple[str, PutValue], S3PutObject]]
List of files to upload.
overwrite: bool, default True
Overwrite the object if it exists. If set to False, the operation succeeds without uploading anything if the key already exists.
List[Tuple[str, str]]
List of (key, url)
pairs corresponding to the files uploaded.
Querying metadata
Get metadata about a single object in S3.
This call makes a single HEAD
request to S3 which can be
much faster than downloading all data with get
.
key: str, optional, default None
Object to query. It can be an S3 url or a path suffix.
return_missing: bool, default False
If set to True, do not raise an exception for a missing key but
return it as an S3Object
with .exists == False
.
S3Object
An S3Object corresponding to the object requested. The object
will have .downloaded == False
.
Get metadata about many objects in S3 in parallel.
This call makes a single HEAD
request to S3 which can be
much faster than downloading all data with get
.
keys: Iterable[str]
Objects to query. Each key can be an S3 url or a path suffix.
return_missing: bool, default False
If set to True, do not raise an exception for a missing key but
return it as an S3Object
with .exists == False
.
List[S3Object]
A list of S3Objects corresponding to the paths requested. The
objects will have .downloaded == False
.
Handling results with S3Object
Most operations above return S3Object
s that encapsulate information about S3 paths and objects.
Note that the data itself is not kept in these objects but it is stored in a temporary directory which is accessible through the properties of this object.
This object represents a path or an object in S3, with an optional local copy.
S3Object
s are not instantiated directly, but they are returned
by many methods of the S3
client.
Does this key correspond to an object in S3?
bool
True if this object points at an existing object (file) in S3.
Has this object been downloaded?
If True, the contents can be accessed through path
, blob
,
and text
properties.
bool
True if the contents of this object have been downloaded.
Key corresponds to the key given to the get call that produced this object.
This may be a full S3 URL or a suffix based on what was requested.
str
Key requested.
Path to a local temporary file corresponding to the object downloaded.
This file gets deleted automatically when a S3 scope exits. Returns None if this S3Object has not been downloaded.
str
Local path, if the object has been downloaded.
Contents of the object as a byte string or None if the object hasn't been downloaded.
bytes
Contents of the object as bytes.
Contents of the object as a string or None if the object hasn't been downloaded.
The object is assumed to contain UTF-8 encoded data.
str
Contents of the object as text.
Size of the object in bytes.
Returns None if the key does not correspond to an object in S3.
int
Size of the object in bytes, if the object exists.
Returns true if this S3Object
contains the content-type MIME header or
user-defined metadata.
If False, this means that content_type
, metadata
, range_info
and
last_modified
will return None.
bool
True if additional metadata is available.
Returns a dictionary of user-defined metadata, or None if no metadata is defined.
Dict
User-defined metadata.
Returns the content-type of the S3 object or None if it is not defined.
str
Content type or None if the content type is undefined.
If the object corresponds to a partially downloaded object, returns information of what was downloaded.
The returned object has the following fields:
total_size
: Size of the object in S3.request_offset
: The starting offset.request_length
: The number of bytes downloaded.
Returns the last modified unix timestamp of the object.
int
Unix timestamp corresponding to the last modified time.
Helper Objects
These objects are simple containers that are used to pass information to get_many
, put_many
, and put_files
. You may use your own objects instead of them, as long as they provide the same set of attributes.
from metaflow.datatools.s3 import S3GetObject
Represents a chunk of an S3 object. A range query is performed to download only a subset of data,
object[key][offset:offset + length]
, from S3.
key: str
Key identifying the object. Works the same way as any key
passed to get
or get_many
.
offset: int
A byte offset in the file.
length: int
The number of bytes to download.
from metaflow.datatools.s3 import S3PutObject
Defines an object with metadata to be uplaoded with put_many
or put_files
.
key: str
Key identifying the object. Works the same way as key
passed to put
or put_many
.
value: str or bytes
Object to upload. Works the same way as obj
passed to
putor
put_many`.
path: str
Path to a local file. Works the same way as path
passed to put_files
.
content_type: str
Optional MIME type for the file.
metadata: Dict
A JSON-encodable dictionary of additional headers to be stored as metadata with the file.