Scalable Compute and Data
After you have prototyped a flow locally and iterated with it for a while, you may face questions like these: How can I test the flow with more data without running out of memory? Or, how can I make the model train faster? You could try to optimize the code to work better on your laptop, but such premature optimization is probably not the best use of your time.
Instead, you can leverage the cloud to get a bigger laptop or more laptops (virtually, not literally). This is the Stage II in Metaflow development: Scaling flows with the cloud. Metaflow makes this easy - no changes in the code required - after you have done the initial legwork to configure infrastructure for Metaflow.
Supersizing Flows
Here's how Metaflow can help make your project more scalable, both tecnically and organizationally:
You can make your existing flow scalable just by adding a line of code,
@resources
. This way you can request more CPUs, memory, or GPUs in your flows. Or, you can parallelize processing over multiple instances, even thousands of them.Once your project starts showing promise with realistic-size workloads, it may start attracting interest from colleagues too. Metaflow contains a number of features, such as namespaces, which make collaboration smoother by allowing many people contribute without interfering with each other's work accidentally.
Easy patterns for scalable, high-performance code
There is no single magic formula for scalability. Instead of proposing a novel paradigm to make your Python code faster, Metaflow provides a set of practical, commonly used patterns, which help you make code more scalable and performant depending on your specific needs.
The patterns fall into three categories:
Performance Optimization: You can improve performance of your code by utilizing off-the-shelf, high-performance libraries such as XGboost or PyTorch. Or, if you need something more custom, you can leverage the vast landscape of data tools for Python, including compilers like Numba to speed up your code.
Scaling Up: One should not underestimate the horsepower of a modern large server, especially one equipped with GPUs. Before considering anything else, you can simply run a step on a beefier cloud instance. Metaflow integrates with Kubernetes that works on all major clouds and AWS Batch, both of which take care of provisioning such machines on demand.
Scaling Out: Besides executing code on a single instance, Metaflow makes it easy to parallelize steps over an arbitrarily large number of instances, leveraging Kubernetes and AWS Batch, giving you access to virtually unlimited amount of computing power. Besides many independent tasks, you can create large compute clusters too on the fly, e.g. to train large (language) models.
Often an effective recipe for scalability is a combination of these three techniques: Start with high-performance Python libraries, run them on large instances, and if necessary, parallelize processing as widely as needed. Also, besides compute concerns, loading data efficiently is something Metaflow can help with.
No matter whether you use Metaflow or any other system, executing at scale comes with a few extra concerns: The larger the scale, the more likely it is that you hit spurious failures. And, you need to make sure your execution environments, including any libraries you depend on, are available consistently wherever the code gets executed. Since these issues are a common headache in all large-scale projects, Metaflow helps manage them too.
What You Will Learn
In this section, you will learn how to make your flows capable of handling more data and execute faster. You will also learn how to scale projects over multiple people by organizing results better. We cover five topics:
- Scaling compute in the cloud
- Dealing with failures
- Managing execution environments
- Loading and storing data efficiently
- Organizing results for smoother collaboration
Before you proceed, make sure to configure infrastructure for Metaflow or sign up for a Metaflow Sandbox.