Episode 5: Statistics Redux
This example revisits Episode 02-statistics: Is this Data Science?.
With Metaflow, you don't need to make any code changes to scale-up your flow by running on remote compute. In this example we re-run the
stats.R workflow adding the
--with batch command line argument. This instructs Metaflow to run all your steps on AWS batch without changing any code. You can control the behavior with additional arguments, like
--max-workers. For this example,
--max-workers is used to limit the number of parallel genre specific statistics computations. You can then access the data artifacts (even the local CSV file) from anywhere because the data is being stored in AWS S3.
--with batchcommand line option
--max-workerscommand line option
- Accessing data artifact stored in AWS S3 from a local Markdown Notebook.
Before playing this episode:
Configure your sandbox.
To play this episode:
If you haven't yet pulled the tutorials to your current working directory, you can follow the instructions here.
Rscript stats.R --package-suffixes=.R,.csv run --with batch --max-workers 4
02-statistics/stats.Rmdin your RStudio and re-run the cells. You can acccess the artifacts stored in AWS S3 from your local RStudio session.