Managing External Libraries
What if your step code wants to import an external library? When you run Metaflow code locally, it behaves as any other Python code, so all libraries available to your Python interpreter can be imported in steps.
However, a core benefit of Metaflow is that the same code can be run in different environments without modifications. Clearly this promise does not hold if a step code depends on locally installed libraries. The topic of providing isolated and encapsulated execution environments is a surprisingly complex one. We recommend the following approaches for handling external libraries, in order of preference:
If your code needs an additional Python module, for instance, a library module that you wrote by yourself, you can place the file in the same directory with your flow file. When Metaflow packages your code for remote execution, any
.py
files in the directory are included in the distribution automatically. In this case, you can simplyimport mymodule
in the step code. This works even with packages with multiple files which can be included as subdirectories.If you need a custom package that is too complex to include in the flow directory, one approach is to install it on the fly in your step code:
os.system('pip install my_package')
import my_packageThis approach is, however, not encouraged for multiple reasons:
- It makes the results harder to reproduce since the version of
my_package
may change; pip install
ing packages on the fly is brittle, especially if performed in tasks running in parallel.
Instead, to define external dependencies for a step, you can instead use the
@conda
decorator which uses conda behind the scenes.- It makes the results harder to reproduce since the version of
Managing dependencies with @conda
decorator
Reproducibility is a core value of Machine Learning Infrastructure. It is hard to collaborate on data science projects effectively without being able to reproduce past results reliably. Metaflow tries to solve several questions related to reproducible research, principle amongst them, dependency management: how can you, the data scientist, specify libraries that your code needs so that the results are reproducible?
Note that reproducibility and dependency management are related but separate topics. We
could solve either one individually. A classic os.system(‘pip install pandas’)
is an
example of dependency management without reproducibility (what if the version of
pandas
changes?). On the other hand, we could make code perfectly reproducible by
forbidding external libraries - reproducibility without dependency management.
Metaflow aims at solving both the questions at once: how can we handle dependencies so that the results are reproducible? Specifically, it addresses the following three issues:
- How to make external dependencies available locally during development?
- How to execute code remotely on Kubernetes or AWS Batch with external dependencies?
- How to ensure that anyone can reproduce past results even months later?
Metaflow provides an execution environment context, --environment=conda
, which runs
every step in a separate environment that only contains dependencies that are explicitly
listed as requirements for that step. The solution relies on
Conda, a language-agnostic open-source package manager by the
authors of Numpy.
For instance:
@conda(libraries={"pandas": "0.22.0"})
def fit_model(self):
...
The above code snippet will execute the fit_model
step in an automatically created
conda environment that contains only specific pinned versions of Python
, Pandas
, and
Metaflow
(and its dependencies boto3
, click
and requests
). No unspecified
libraries outside of the standard Python library would be available. This isolates your
code from any external factors, such as automatically upgrading libraries.
Internally, Metaflow handles automatic dependency resolution, cross-platform support, environment snapshotting and caching in Amazon S3 (if enabled). We require that all dependencies are pinned to a specific version. This avoids any ambiguity about the version used and helps make deployments fully immutable; in other words, once you deploy a version in production, nothing will inadvertently change its behavior without explicit action.
Local Execution
Let's look at the LinearFlow from before:
from metaflow import FlowSpec, step
class LinearFlow(FlowSpec):
@step
def start(self):
self.my_var = 'hello world'
self.next(self.a)
@step
def a(self):
print('the data artifact is: %s' % self.my_var)
self.next(self.end)
@step
def end(self):
print('the data artifact is still: %s' % self.my_var)
if __name__ == '__main__':
LinearFlow()
You can execute this flow in a conda environment by executing:
$ python LinearFlow.py --environment=conda run
Metaflow will bootstrap a dedicated conda environment for each of the steps of the workflow, executing each of them in that isolated environment, resulting in an output like this -
2019-11-27 20:04:27.579 Bootstrapping conda environment...(this could take a few minutes)
2019-11-27 20:05:13.623 Workflow starting (run-id 164):
2019-11-27 20:05:14.273 [164/start/4222215 (pid 14011)] Task is starting.
2019-11-27 20:05:16.426 [164/start/4222215 (pid 14011)] Task finished successfully.
2019-11-27 20:05:17.246 [164/a/4222216 (pid 14064)] Task is starting.
2019-11-27 20:05:19.014 [164/a/4222216 (pid 14064)] the data artifact is: hello world
2019-11-27 20:05:19.484 [164/a/4222216 (pid 14064)] Task finished successfully.
2019-11-27 20:05:20.192 [164/end/4222217 (pid 14117)] Task is starting.
2019-11-27 20:05:21.756 [164/end/4222217 (pid 14117)] the data artifact is still: hello world
2019-11-27 20:05:22.436 [164/end/4222217 (pid 14117)] Task finished successfully.
2019-11-27 20:05:22.512 Done!
You might notice some delay when you execute this flow for the first time, as Metaflow performs dependency resolution, creates the environments and caches the results. Subsequent executions rely on this cache to reduce this overhead going forward.
Let’s import the module sklearn
in one of the steps -
from metaflow import FlowSpec, step
class LinearFlow(FlowSpec):
@step
def start(self):
import sklearn
self.my_var = 'hello world'
self.next(self.a)
@step
def a(self):
print('the data artifact is: %s' % self.my_var)
self.next(self.end)
@step
def end(self):
print('the data artifact is still: %s' % self.my_var)
if __name__ == '__main__':
LinearFlow()
Let's execute this flow in a conda environment, again, by executing:
$ python LinearFlow.py --environment=conda run
You will notice that the execution progresses fairly quickly compared to before since
all the specified dependencies are already cached locally, but the flow fails at step
start
, with the error `ModuleNotFoundError: No module named ‘sklearn’`
, even
though you might have the module installed locally already.
You can add an explicit dependency on the module sklearn
by using the @conda
decorator in the start
step -
from metaflow import FlowSpec, step, conda
class LinearFlow(FlowSpec):
@conda(libraries={'scikit-learn':'0.21.1'})
@step
def start(self):
import sklearn
self.my_var = 'hello world'
self.next(self.a)
@step
def a(self):
print('the data artifact is: %s' % self.my_var)
self.next(self.end)
@step
def end(self):
print('the data artifact is still: %s' % self.my_var)
if __name__ == '__main__':
LinearFlow()
Let’s execute this flow, in the conda environment again -
$ python LinearFlow.py --environment=conda run
You will notice that bootstrapping takes a little bit longer than before as we pull in
the new set of dependencies (scikit-learn
0.21.1
and its dependencies) and the flow
succeeds. scikit-learn 0.21.1
is only available to the step start
and no other step.
Every subsequent execution of your flow is guaranteed to execute in the same environment unless you explicitly make a change to the contrary. Behind the scenes, we resolve the dependencies you have specified in your steps and cache both the resolution order and dependencies (stated and transitive) locally and on Amazon S3 to be used for subsequent executions. We do this to isolate your code from changes not related to your code. This also allows for isolation between runs, you should be able to use a different version of tensorflow for different flows and even across different steps of the same flow if that suits your use-case.
Remote Execution
You can execute your flow on Kubernetes or AWS Batch, like before -
$ python LinearFlow.py --environment=conda run --with kubernetes
$ python LinearFlow.py --environment=conda run --with batch
Since we cache the exact set of dependencies (stated and transitive) for your flow in Amazon S3, you are not at the mercy of an upstream package repository and can avoid overwhelming it, particularly while running multiple parallel tasks, while being guaranteed the same execution environment locally, on Kubernetes and on AWS Batch.
Note that, the exact set of dependencies and their behavior might differ between an execution on macOS (darwin) and on Kubernetes/AWS Batch (linux).
@conda
Tips and Tricks
Can I use an alternate dependency manager, given that conda can be slow at resolving dependencies?
By default, Metaflow relies on conda for dependency resolution but for many data science
packages, conda can be quite slow for a variety of different
reasons.
Mamba is another cross-platform package
manager that is fully compatible with conda packages and offers better performance and
reliability compared to conda. You can
use mamba instead of conda by setting the environment variable
METAFLOW_CONDA_DEPENDENCY_RESOLVER=mamba
either in your execution environment or
inside your metaflow config (usually located at ~/.metaflowconfig/
).
How do I specify the version of Python interpreter?
By default, we take the version of the Python interpreter with which you invoke your
flow. You can override it whatever version you choose, e.g, @conda(python='3.6.5')
.
What about storing and retrieving data artifacts between steps in my flow?
Metaflow relies on pickle for object serialization/deserialization. Make sure you have compatible versions (ideally the same version) of the object module as a dependency in your steps which rely on interacting with this object artifact.
How do I specify flow-level dependencies?
Using the flow-level @conda_base
decorator you can specify, for the flow, explicit
library dependencies, python version and also if you want to exclude all steps from
executing within a conda environment. Some examples -
@conda_base(libraries={'numpy':'1.15.4'}, python=’3.6.5’)
class LinearFlow(FlowSpec):
...
@conda_base(disabled=True)
class LinearFlow(FlowSpec):
...
Step-level @conda
decorators, for the step, will directly update the explicit library
dependencies, python version, and conda environment exclusion as specified by the
@conda_base
decorator. Some examples:
from metaflow import FlowSpec, step, conda, conda_base
@conda_base(libraries={'numpy':'1.15.4'}, python='3.6.5')
class MyFlow(FlowSpec):
@step
def a(self):
...
@conda(libraries={'numpy':'1.16.3'})
@step
def b(self):
...
@conda(disabled=True)
@step
def c(self):
...
In this example, step a
executes in the environment libraries={'numpy':'1.15.4'},
python=’3.6.5’
, step b
executes in the environment libraries={'numpy':'1.16.3'},
python=’3.6.5’
, while step c
executes outside the conda environment in the user
environment.