Episode 6: Statistics Redux
Computing in the Cloud.​
This example revisits Episode 02-statistics: Is this Data Science?. With Metaflow, you don't need to make any code changes to scale-up your flow by running on remote compute. In this example, we re-run the stats.py
workflow adding the --with batch
command line argument. This instructs Metaflow to run all your steps on AWS batch without changing any code. You can control the behavior with additional arguments, like --max-workers
. For this example, max-workers
is used to limit the number of parallel genre-specific statistics computations. You can then access the data artifacts (even the local CSV file) from anywhere because the data is being stored in AWS S3.
This tutorial uses pandas
which may not be available in your environment. Use the 'conda' package manager with the conda-forge
channel added to run this tutorial in any environment
You can find the tutorial code on GitHub
Showcasing:
--with batch
command line option--max-workers
command line option- Accessing data locally or remotely
Before playing this episode:
python -m pip install pandas
python -m pip install notebook
python -m pip install matplotlib
- This tutorial requires access to compute and storage resources on AWS, which can be configured by
- This tutorial requires the
conda
package manager to be installed with the conda-forge channel added.- Download Miniconda at https://docs.conda.io/en/latest/miniconda.html
conda config --add channels conda-forge
To play this episode:
cd metaflow-tutorials
python 02-statistics/stats.py --environment conda run --with batch --max-workers 4 --with conda:python=3.7,libraries="{pandas:0.24.2}"
jupyter-notebook 06-statistics-redux/stats.ipynb
- Open stats.ipynb in your remote Sagemaker notebook
caution
Note for Python 2.7 users: when opening the stats.ipynb in a Sagemaker notebook you will need to change the python kernel by clicking Kernel -> Change Kernel -> conda_python2 from the pull down menu. This ensures the Pandas dataframe will deserialize correctly.