Episode 6: Statistics Redux

Computing in the Cloud.

This example revisits Episode 02-statistics: Is this Data Science?. With Metaflow, you don't need to make any code changes to scale-up your flow by running on remote compute. In this example we re-run the stats.py workflow adding the --with batch command line argument. This instructs Metaflow to run all your steps on AWS batch without changing any code. You can control the behavior with additional arguments, like --max-workers. For this example, max-workers is used to limit the number of parallel genre specific statistics computations. You can then access the data artifacts (even the local CSV file) from anywhere because the data is being stored in AWS S3.

This tutorial uses pandas which may not be available in your environment. Use the 'conda' package manager with the conda-forge channel added to run this tutorial in any environment

You can find the tutorial code on GitHub‚Äč

Showcasing:

  • --with batch command line option

  • --max-workers command line option

  • Accessing data locally or remotely

Before playing this episode:

  1. python -m pip install pandas

  2. python -m pip install notebook

  3. python -m pip install matplotlib

  4. This tutorial requires access to compute and storage resources on AWS, which can be configured by

    1. Following the instructions here or

    2. Requesting a sandbox.

  5. This tutorial requires the conda package manager to be installed with the conda-forge channel added.

    1. conda config --add channels conda-forge

To play this episode:

  1. cd metaflow-tutorials

  2. python 02-statistics/stats.py --environment conda run --with batch --max-workers 4 --with conda:python=3.7,libraries="{pandas:0.24.2}"

  3. jupyter-notebook 06-statistics-redux/stats.ipynb

  4. Open stats.ipynb in your remote Sagemaker notebook

Note for Python 2.7 users: when opening the stats.ipynb in a Sagemaker notebook you will need to change the python kernel by clicking Kernel -> Change Kernel -> conda_python2 from the pull down menu. This ensures the Pandas dataframe will deserialize correctly.