metaflow.S3provides a way to load data directly from S3, bypassing any query engines such as Spark. Combined with a metadata catalog, it is easy to write shims on top of
metaflow.S3to directly interface with data files on S3 backing your tables. Since data is loaded directly from S3, there is no limitation to the number of parallel processes. The size of data is only limited by the size of your instance, which can be easily controlled with the
@resourcesdecorator. The best part is that this approach is blazingly fast compared to executing SQL.
model.save()in a table.
selfin your Metaflow flow, the object gets automatically persisted in S3 as a Metaflow artifact. Hence, in most cases you do not need to worry about saving data or models to S3 explicitly. We recommend that you use Metaflow artifacts whenever possible, since they are easily accessible through the Client API by you, by other people, and by other workflows.
metaflow.S3over Metaflow artifacts is that you get to see and control the S3 locations for data. Also, you must take care of object serialization by yourself:
metaflow.S3only deals with objects of type
metaflow.S3provides two key benefits: First, when used in Metaflow flows, it can piggyback on Metaflow versioning, which makes it easy to track the lineage of an object back to the Metaflow run that produced it. Secondly,
metaflow.S3provides better throughput than any other S3 client that we are aware of. In other words, it is very fast at loading and storing large amounts of data in S3.
metaflow.S3if you can use Metaflow artifacts instead. In contrast to Metaflow artifacts,
metaflow.S3is more tedious to use, uses space more wastefully, and it is less suitable for moving data between Metaflow steps reliably.
withscope in Python. Objects retrieved from S3 are stored in local temporary files for the lifetime of the
withscope, not in memory. You can use
withbut in this case you need to call
s3.close()to get rid of the temporary files. See examples of this below.
metaflow.S3, you need to set your
@resourcesproperly. However don't request more resources than what your workload actually needs.
metaflow.S3whether it is being used in the context of a Metaflow run. A run can refer to a currently running flow (
run=self) or a past run,
runis not specified,
metaflow.S3can be used to access data without versioning in arbitrary S3 locations.
metaflow.S3is to store auxiliary data in a Metaflow flow. Here is an example:
s3://my-bucket/metaflow/userdata/v1/S3DemoFlow/3/example_object, with external systems. Note that the URL includes both the flow name,
S3DemoFlow, as well as its unique run id,
3, which allow us to track the lineage of the object back to the run that produced it.
metaflow.S3provides a default S3 location for storing data. You could change the location by defining
S3(bucket='my-bucket', prefix='/my/prefix')for the constructor. Metaflow versioning information would be concatenated to the
Runand use it to initialize an
S3is initialized without any arguments, all operations require a full S3 URL.
getcall will raise an exception. You can call
return_missing=Trueif you want to return a missing URL as an ordinary result object, as described in the section below.
put_*calls will overwrite existing keys in S3. To avoid this behavior you can invoke your
overwrite=False. Refer to this section for some of the pitfalls involved with overwriting keys in S3.
getoperations return an
S3Object, backed by a temporary file on local disk, which exposes a number of attributes about the object:
withscope as the temporary file pointed at
s3obj.pathwill get deleted as the scope exits.
S3Objectmay also refer to an S3 URL that does not correspond to an object in S3. These objects have
existsproperty set to
False. Non-existent objects may be returned by a
list_pathcall, if the result refers to an S3 prefix, not an object. Listing operations also set
False, to distinguish them from operations that download data locally. Also
get_manymay return non-existent objects if you call these methods with an argument
putoperations work equally. The context is only used to construct an appropriate S3 URL.
.put()as shown above,
metaflow.S3really shines at operating multiple files at once.
S3Objectsreturned is always in the same order as long as the underlying data does not change. This can be important e.g. if you use
metaflow.S3to feed data for a model. The input data will be in a deterministic order so results should be easily reproducible.
get_many()to load arbitrarily many objects at once:
get_many()loads objects in parallel, which is much faster than loading individual objects sequentially. You can achieve the optimal throughput with S3 only when you operate on many files in parallel.
get_manycall will raise an exception. If you don't want to fail all objects because of missing URLs, call
return_missing=True. This will make
get_manyreturn missing URLs amongst other results. You can distinguish between the found and not found URLs using the
get_recursivetakes a list of prefixes. This is useful for achieving the maximum level of parallelism when retrieving data under multiple prefixes.
s3root, you can use
get_all()to get all files recursively under the given prefix.
get_many, you need to know the exact names of the objects to download. S3 is optimized for looking up specific names, so it is preferable to structure your code around known names. However, sometimes this is not possible and you need to check first what is available in S3.
list_recursive. The first method provides the next level of prefixes (directories) in S3, directly under the given prefix. The latter method provides all objects under the given prefix. Since
list_pathsreturns a subset of prefixes returned by
list_recursive, it is typically a much faster operation.
list_pathsa list of prefixes:
.existsproperty of the returned
list_recursivecan take a list of prefixes to process in parallel.
list_recursive, filter out some keys from the listing, and provide the pruned list to
get_manyfor fast parallelized downloading.
metaflow.S3in your Metaflow flows, make sure that every task and step writes to a unique key. Otherwise you may find results unpredictable and inconsistent.
put_*calls changes the behavior of S3 slightly compared to the default mode of
overwrite=True. There may be a small delay (typically in the order of milliseconds) before the key becomes available for reading.
currentmodule as a part of your S3 keys.
selfobject in your flow.