Skip to main content

S3 - Accessing data in S3 quickly

The S3 client is a wrapper over the standard AWS Python library, boto. It contains enhancements that are relevant for data-intensive applications:

  • Supports accessing large amounts of data quickly through parallel operations (functions with the _many suffix). You can download up to 20Gbps on a large EC2 instance.
  • Improved error handling.
  • Supports versioned data through S3(run=self) and S3(run=Run).
  • User-friendly API with minimal boilerplate.
  • Convenient API for advanced featured such as range requests (downloading partial files) and object headers.

For instructions how to use the class, see Loading and Storing Data.

The S3 client

S3(tmproot='.', bucket=None, prefix=None, run=None, s3root=None)

[source]

from metaflow import S3

The Metaflow S3 client.

This object manages the connection to S3 and a temporary diretory that is used to download objects. Note that in most cases when the data fits in memory, no local disk IO is needed as operations are cached by the operating system, which makes operations fast as long as there is enough memory available.

The easiest way is to use this object as a context manager:

with S3() as s3:
    data = [obj.blob for obj in s3.get_many(urls)]
print(data)

The context manager takes care of creating and deleting a temporary directory automatically. Without a context manager, you must call .close() to delete the directory explicitly:

s3 = S3()
data = [obj.blob for obj in s3.get_many(urls)]
s3.close()

You can customize the location of the temporary directory with tmproot. It defaults to the current working directory.

To make it easier to deal with object locations, the client can be initialized with an S3 path prefix. There are three ways to handle locations:

  1. Use a metaflow.Run object or self, e.g. S3(run=self) which initializes the prefix with the global DATATOOLS_S3ROOT path, combined with the current run ID. This mode makes it easy to version data based on the run ID consistently. You can use the bucket and prefix to override parts of DATATOOLS_S3ROOT.

  2. Specify an S3 prefix explicitly with s3root, e.g. S3(s3root='s3://mybucket/some/path').

  3. Specify nothing, i.e. S3(), in which case all operations require a full S3 url prefixed with s3://.

Parameters 

tmproot: str, default: '.'

Where to store the temporary directory.

bucket: str, optional

Override the bucket from DATATOOLS_S3ROOT when run is specified.

prefix: str, optional

Override the path from DATATOOLS_S3ROOT when run is specified.

run: FlowSpec or Run, optional

Derive path prefix from the current or a past run ID, e.g. S3(run=self).

s3root: str, optional

If run is not specified, use this as the S3 prefix.

S3.close(self)

[source]

Delete all temporary files downloaded in this context.

Downloading data

S3.get(self, key, return_missing, return_info)

[source]

Get a single object from S3.

Parameters 

key: Union[str, S3GetObject], optional, default None

Object to download. It can be an S3 url, a path suffix, or an S3GetObject that defines a range of data to download. If None, or not provided, gets the S3 root.

return_missing: bool, default False

If set to True, do not raise an exception for a missing key but return it as an S3Object with .exists == False.

return_info: bool, default True

If set to True, fetch the content-type and user metadata associated with the object at no extra cost, included for symmetry with get_many

Returns 

S3Object

An S3Object corresponding to the object requested.

S3.get_many(self, keys, return_missing, return_info)

[source]

Get many objects from S3 in parallel.

Parameters 

keys: Iterable[Union[str, S3GetObject]]

Objects to download. Each object can be an S3 url, a path suffix, or an S3GetObject that defines a range of data to download.

return_missing: bool, default False

If set to True, do not raise an exception for a missing key but return it as an S3Object with .exists == False.

return_info: bool, default True

If set to True, fetch the content-type and user metadata associated with the object at no extra cost, included for symmetry with get_many.

Returns 

List[S3Object]

S3Objects corresponding to the objects requested.

S3.get_recursive(self, keys, return_info)

[source]

Get many objects from S3 recursively in parallel.

Parameters 

keys: Iterable[str]

Prefixes to download recursively. Each prefix can be an S3 url or a path suffix which define the root prefix under which all objects are downloaded.

return_info: bool, default False

If set to True, fetch the content-type and user metadata associated with the object.

Returns 

List[S3Object]

S3Objects stored under the given prefixes.

S3.get_all(self, return_info)

[source]

Get all objects under the prefix set in the S3 constructor.

This method requires that the S3 object is initialized either with run or s3root.

Parameters 

return_info: bool, default False

If set to True, fetch the content-type and user metadata associated with the object.

Returns 

Iterable[S3Object]

S3Objects stored under the main prefix.

Listing objects

S3.list_paths(self, keys)

[source]

List the next level of paths in S3.

If multiple keys are specified, listings are done in parallel. The returned S3Objects have .exists == False if the path refers to a prefix, not an existing S3 object.

For instance, if the directory hierarchy is

a/0.txt
a/b/1.txt
a/c/2.txt
a/d/e/3.txt
f/4.txt

The list_paths(['a', 'f']) call returns

a/0.txt (exists == True)
a/b/ (exists == False)
a/c/ (exists == False)
a/d/ (exists == False)
f/4.txt (exists == True)
Parameters 

keys: Iterable[str], optional, default None

List of paths.

Returns 

List[S3Object]

S3Objects under the given paths, including prefixes (directories) that do not correspond to leaf objects.

S3.list_recursive(self, keys)

[source]

List all objects recursively under the given prefixes.

If multiple keys are specified, listings are done in parallel. All objects returned have .exists == True as this call always returns leaf objects.

For instance, if the directory hierarchy is

a/0.txt
a/b/1.txt
a/c/2.txt
a/d/e/3.txt
f/4.txt

The list_paths(['a', 'f']) call returns

a/0.txt (exists == True)
a/b/1.txt (exists == True)
a/c/2.txt (exists == True)
a/d/e/3.txt (exists == True)
f/4.txt (exists == True)
Parameters 

keys: Iterable[str], optional, default None

List of paths.

Returns 

List[S3Object]

S3Objects under the given paths.

Uploading data

S3.put(self, key, obj, overwrite, content_type, metadata)

[source]

Upload a single object to S3.

Parameters 

key: Union[str, S3PutObject]

Object path. It can be an S3 url or a path suffix.

obj: PutValue

An object to store in S3. Strings are converted to UTF-8 encoding.

overwrite: bool, default True

Overwrite the object if it exists. If set to False, the operation succeeds without uploading anything if the key already exists.

content_type: str, optional, default None

Optional MIME type for the object.

metadata: Dict[str, str], optional, default None

A JSON-encodable dictionary of additional headers to be stored as metadata with the object.

Returns 

str

URL of the object stored.

S3.put_many(self, key_objs, overwrite)

[source]

Upload many objects to S3.

Each object to be uploaded can be specified in two ways:

  1. As a (key, obj) tuple where key is a string specifying the path and obj is a string or a bytes object.

  2. As a S3PutObject which contains additional metadata to be stored with the object.

Parameters 

key_objs: List[Union[Tuple[str, PutValue], S3PutObject]]

List of key-object pairs to upload.

overwrite: bool, default True

Overwrite the object if it exists. If set to False, the operation succeeds without uploading anything if the key already exists.

Returns 

List[Tuple[str, str]]

List of (key, url) pairs corresponding to the objects uploaded.

S3.put_files(self, key_paths, overwrite)

[source]

Upload many local files to S3.

Each file to be uploaded can be specified in two ways:

  1. As a (key, path) tuple where key is a string specifying the S3 path and path is the path to a local file.

  2. As a S3PutObject which contains additional metadata to be stored with the file.

Parameters 

key_paths: List[Union[Tuple[str, PutValue], S3PutObject]]

List of files to upload.

overwrite: bool, default True

Overwrite the object if it exists. If set to False, the operation succeeds without uploading anything if the key already exists.

Returns 

List[Tuple[str, str]]

List of (key, url) pairs corresponding to the files uploaded.

Querying metadata

S3.info(self, key, return_missing)

[source]

Get metadata about a single object in S3.

This call makes a single HEAD request to S3 which can be much faster than downloading all data with get.

Parameters 

key: str, optional, default None

Object to query. It can be an S3 url or a path suffix.

return_missing: bool, default False

If set to True, do not raise an exception for a missing key but return it as an S3Object with .exists == False.

Returns 

S3Object

An S3Object corresponding to the object requested. The object will have .downloaded == False.

S3.info_many(self, keys, return_missing)

[source]

Get metadata about many objects in S3 in parallel.

This call makes a single HEAD request to S3 which can be much faster than downloading all data with get.

Parameters 

keys: Iterable[str]

Objects to query. Each key can be an S3 url or a path suffix.

return_missing: bool, default False

If set to True, do not raise an exception for a missing key but return it as an S3Object with .exists == False.

Returns 

List[S3Object]

A list of S3Objects corresponding to the paths requested. The objects will have .downloaded == False.

Handling results with S3Object

Most operations above return S3Objects that encapsulate information about S3 paths and objects.

Note that the data itself is not kept in these objects but it is stored in a temporary directory which is accessible through the properties of this object.

S3Object()

[source]

This object represents a path or an object in S3, with an optional local copy.

S3Objects are not instantiated directly, but they are returned by many methods of the S3 client.

S3Object.exists

[source]

Does this key correspond to an object in S3?

Returns 

bool

True if this object points at an existing object (file) in S3.

S3Object.downloaded

[source]

Has this object been downloaded?

If True, the contents can be accessed through path, blob, and text properties.

Returns 

bool

True if the contents of this object have been downloaded.

S3Object.url

[source]

S3 location of the object

Returns 

str

The S3 location of this object.

S3Object.prefix

[source]

Prefix requested that matches this object.

Returns 

str

Requested prefix

S3Object.key

[source]

Key corresponds to the key given to the get call that produced this object.

This may be a full S3 URL or a suffix based on what was requested.

Returns 

str

Key requested.

S3Object.path

[source]

Path to a local temporary file corresponding to the object downloaded.

This file gets deleted automatically when a S3 scope exits. Returns None if this S3Object has not been downloaded.

Returns 

str

Local path, if the object has been downloaded.

S3Object.blob

[source]

Contents of the object as a byte string or None if the object hasn't been downloaded.

Returns 

bytes

Contents of the object as bytes.

S3Object.text

[source]

Contents of the object as a string or None if the object hasn't been downloaded.

The object is assumed to contain UTF-8 encoded data.

Returns 

str

Contents of the object as text.

S3Object.size

[source]

Size of the object in bytes.

Returns None if the key does not correspond to an object in S3.

Returns 

int

Size of the object in bytes, if the object exists.

S3Object.has_info

[source]

Returns true if this S3Object contains the content-type MIME header or user-defined metadata.

If False, this means that content_type, metadata, range_info and last_modified will return None.

Returns 

bool

True if additional metadata is available.

S3Object.metadata

[source]

Returns a dictionary of user-defined metadata, or None if no metadata is defined.

Returns 

Dict

User-defined metadata.

S3Object.content_type

[source]

Returns the content-type of the S3 object or None if it is not defined.

Returns 

str

Content type or None if the content type is undefined.

S3Object.range_info

[source]

If the object corresponds to a partially downloaded object, returns information of what was downloaded.

The returned object has the following fields:

  • total_size: Size of the object in S3.
  • request_offset: The starting offset.
  • request_length: The number of bytes downloaded.

S3Object.last_modified

[source]

Returns the last modified unix timestamp of the object.

Returns 

int

Unix timestamp corresponding to the last modified time.

Helper Objects

These objects are simple containers that are used to pass information to get_many, put_many, and put_files. You may use your own objects instead of them, as long as they provide the same set of attributes.

S3GetObject(key, offset, length)

[source]

from metaflow.datatools.s3 import S3GetObject

Represents a chunk of an S3 object. A range query is performed to download only a subset of data, object[key][offset:offset + length], from S3.

Attributes 

key: str

Key identifying the object. Works the same way as any key passed to get or get_many.

offset: int

A byte offset in the file.

length: int

The number of bytes to download.

S3PutObject(key, value, path, content_type, metadata)

[source]

from metaflow.datatools.s3 import S3PutObject

Defines an object with metadata to be uplaoded with put_many or put_files.

Attributes 

key: str

Key identifying the object. Works the same way as key passed to put or put_many.

value: str or bytes

Object to upload. Works the same way as obj passed to putorput_many`.

path: str

Path to a local file. Works the same way as path passed to put_files.

content_type: str

Optional MIME type for the file.

metadata: Dict

A JSON-encodable dictionary of additional headers to be stored as metadata with the file.