Managing Libraries
Whereas the previous page covered packaging of your own Python modules and packages, this page covers handling of 3rd party dependencies that are published as installable Python packages.
Metaflow supports installation of external packages from two Python package
repositories: PyPI and Conda,
using @pypi
and @conda
decorators. If you wonder why we need two
decorators and when to use which, see
Conda vs. PyPI. To learn why you
should use the decorators and not install packages
manually, see Packaging Internals.
The @pypi
and @conda
decorators
The @pypi
and @conda
decorators allow you to make arbitrary packages
available to Metaflow steps, as if you were installing them manually with
pip install
or conda install
. This functionality works in conjuction
with local code packaging, so
steps can execute in safely isolated, remote-execution friendly environments
that contain all dependencies they need.
Crucially, when using @pypi
or @conda
steps can only access packages
that are explicitly included either in the code package or specified in the
decorator. System-wide packages are not available by design, making sure that
all dependencies are explicitly specified and managed, ensuring
reproducibility and stability of the flow.
All examples below work interchangeably with @pypi
and @conda
. The
examples include both @pypi
lines as well as @conda
lines commented out,
so you can easily test both the decorators. In a real-life setting, you
would use either @conda
or @pypi
.
Bleeding edge versions of the decorators
The default @pypi
and @conda
decorators provide basic functionality
that covers typical use cases. If you want to use more advanced features
that are not included in the default decorators yet, you can use
Netflix's Metaflow extensions.
These decorators are fully compatible with the built-in
@pypi
and @conda
but provide several additional features:
- Named environments which enables you easy environment saving and sharing.
- A more full-fledged
environment
command allowing you to resolve environments using externalrequirements.txt
orenvironment.yml
files as well as inspect and rehydrate environments used in any previously run step. - More extensive package support: You can mix and match Conda and Pypi packages, more types of Pypi packages, etc.).
- It is generally more efficient with caching and resolving and provides options for faster performance.
To use, simply install this package. Documentation can be found here. Let us know on the Metaflow community Slack if you find these additional features useful!
@pypi
in action
This example demonstrates typical use of @pypi
(or @conda
). Save the flow as fractalflow.py
:
from metaflow import FlowSpec, card, pypi, step, current
from metaflow.cards import Image
class FractalFlow(FlowSpec):
@step
def start(self):
self.next(self.plot)
@pypi(python='3.9.13',
packages={'pyfracgen': '0.0.11',
'matplotlib': '3.8.0'})
@card(type='blank')
@step
def plot(self):
# pylint: disable=import-error,no-member
import pyfracgen as pf
from matplotlib import pyplot as plt
string = "AAAAAABBBBBB"
xbound = (2.5, 3.4)
ybound = (3.4, 4.0)
res = pf.lyapunov(
string, xbound, ybound, width=4, height=3,
dpi=300, ninit=2000, niter=2000
)
pf.images.markus_lyapunov_image(res, plt.cm.bone, plt.cm.bone_r, gammas=(8, 1))
current.card.append(Image.from_matplotlib(plt.gcf()))
self.next(self.end)
@step
def end(self):
pass
if __name__ == '__main__':
FractalFlow()
Note the following details in the flow:
- We use the
@pypi
decorator to make two additional libraries,pyfracgen
andmatplotlib
, as well as their transitive dependencies available in theplot
step. - The
@pypi
decorator requires that we specify the explicit versions of packages we want to import. This ensures that the code executes predictably even when the packages change over time. - We specify the Python version. Some packages containing non-Python code don't work with all Python versions, so fixing the Python version ensures reproducibility. Your colleague may use a different version than what you have installed locally.
- We
import
the packages inside theplot
step and not at the top of the file. The packages are only available in this step, so importing at the top level wouldn't work. - The
pylint
checker may get confused about packages that are not installed system-wide. The# pylint:
comment addresses the false alerts.
All these points apply to the @conda
decorator as well.
Execute the flow as follows:
$ python fractalflow.py --environment=pypi run
The --environment=pypi
option which ensures that every step gets its own isolated environment.
This option is required whenever you use @pypi
or @conda
decorators.
Note that the --environment
option comes right after the Python file name, not after run
or other commands.
When you execute the flow for the first time, the environments need to be resolved which can take several minutes (see Internals for more details). Subsequent runs will start much faster as cached environments will be available.
This short clip shows a run in action, showing also a resulting card:
@conda
in action
The @conda
decorator works similarly to @pypi
. It supports the same python
and packages
arguments. For backwards compatibility, libraries
works as an alias for packages
.
from metaflow import FlowSpec, step, conda, card
class ScikitFlow(FlowSpec):
@conda(python='3.9.13',
packages={'scikit-learn': '1.3.1'})
@card
@step
def start(self):
# pylint: disable=import-error,no-member
from sklearn.datasets import load_iris
self.data = load_iris()
self.next(self.end)
@step
def end(self):
pass
if __name__ == '__main__':
ScikitFlow()
Save the flow as scikitflow.py
as run it as follows:
$ python scikitflow.py --environment=conda run
You can search all available public Conda packages at anaconda.org
By default, all conda packages are searched in the default channel i.e. conda-forge. However, some packages may exist in other channels.
To specify a custom channel for a particular package, use the following syntax: {channel_name}::{package_name}
.
Example: The package pytorch
exists in the pytorch
channel.
Thus, the dependency will be specified as follows:
@conda(
python='3.9.13',
packages={'pytorch::pytorch': '2.1.2'}
)
Using the same packages in all steps
Sometimes you may want to use the same set of packages in all steps. Repeating the same
@conda
or @pypi
line for every step would be redundant.
In this case, you can use the @conda_base
and @pypi_base
decorators that are defined
at the class level, indicating that all steps should use the same environment.
from metaflow import FlowSpec, step, conda_base
@conda_base(python='3.9.13',
packages={'scikit-learn': '1.3.1'})
class AllScikitFlow(FlowSpec):
@step
def start(self):
# pylint: disable=import-error,no-member
from sklearn.datasets import load_iris
self.data = load_iris()
self.next(self.end)
@step
def end(self):
print(self.data)
if __name__ == '__main__':
AllScikitFlow()
Save the flow as allscikitflow.py
as run it as follows:
$ python allscikitflow.py --environment=conda run
Note that we access a scikit
object stored in self.data
in the end
step. When accessing
package-specific objects through artifacts, a matching package must be available in every step
using the artifact. This is the case by default when using the class-level decorators.
If you tried to do the same in the original scikitflow.py
, it would fail due to a missing package
in the end
step.
Libraries in remote tasks
A major benefit of @pypi
and @conda
is that they allow you to define libraries that will be
automatically made available when you execute tasks remotely
on @kubernetes
or on @batch
. You don't need to do anything special to make this happen.
If you have @kubernetes
or @batch
configured to work with
Metaflow, you can try:
$ python fractalflow.py --environment=pypi run --with kubernetes
or --with batch
correspondingly. Everything should work exactly the same as with
local runs.
Libraries in production deployments
Another major benefit of @pypi
and @conda
is that the environments they create a guaranteed
to be stable when deployed to production.
For instance, if you deployed FractalFlow
in production on Argo Workflows like this:
$ python fractalflow.py --environment=pypi argo-workflows create
all production runs would be guaranteed to always use exactly the same environment, even if new versions of the packages get released or even if old packages get removed. This is very desirable for production runs that should be maximally stable and predictable.
Disabling environments
When using --environment=conda
or --environment=pypi
all steps are executed in
isolated environments. As a result, the steps don't have access to packages that are
installed system-wide. This is desirable, as it makes the flow more reproducible as it
doesn't depend on packages that may exist just in your environment.
However, sometimes a need may arise to be able to access a system-wide package in one
step, while using isolated environments in other steps. For instance, you may use
a custom Docker image in conjuction with @pypi
or @conda
, accessing packages directly from the image in a step.
To make this possible, you can set @conda(disabled=True)
or @pypi(disabled=True)
at
the step level. A step with PyPI/Conda disabled behaves as if the flow runs without
--environment
.
To demonstrate this, consider this flow, peekabooflow.py
, that prints out the path
of the Python interpreter used in each step:
import sys
from metaflow import FlowSpec, step, conda_base, conda
@conda_base(python='3.9.13')
class PeekabooFlow(FlowSpec):
@step
def start(self):
print(sys.executable)
self.next(self.peekaboo)
@conda(disabled=True)
@step
def peekaboo(self):
print(sys.executable)
self.next(self.end)
@step
def end(self):
print(sys.executable)
if __name__ == '__main__':
PeekabooFlow()
Run the flow as usual:
$ python peekabooflow.py --environment=conda run
Notice how the path is the same in the start
and end
steps but different in the
peekaboo
step which uses a system-wide Python installation.