Episode 6: Statistics Redux
Computing in the Cloud.
This example revisits Episode 02-statistics: Is this Data
Science?. With Metaflow, you don't need to
make any code changes to scale up your flow by running on remote compute. In this
example, we re-run the stats.py
workflow adding the --with kubernetes
command line
argument. This instructs Metaflow to run all your steps in the cloud without changing
any code. You can control the behavior with additional arguments, like
--max-workers
. For this example, max-workers
is used to limit the number of
parallel genre-specific statistics computations. You can then access the data artifacts
(even the local CSV file) from anywhere because the data is being stored in the
cloud-based datastore.
You can find the tutorial code on GitHub
Showcasing:
--with kubernetes
command line option--max-workers
command line option- Accessing data locally or remotely
Before playing this episode:
python -m pip install notebook
python -m pip install matplotlib
- This tutorial requires access to compute and storage resources on in the cloud, which can be configured by
To play this episode:
cd metaflow-tutorials
python 02-statistics/stats.py run --with kubernetes --max-workers 4
jupyter-notebook 06-statistics-redux/stats.ipynb
- Open stats.ipynb in your notebook
Note for Python 2.7 users: when opening the stats.ipynb in a Sagemaker notebook you will need to change the python kernel by clicking Kernel -> Change Kernel -> conda_python2 from the pull down menu. This ensures the Pandas dataframe will deserialize correctly.