Production Deployments
What does production mean exactly? Surely the answer depends on who you ask and what application they are working on. There are so many different ways to produce business value with machine learning and data science that there can't be a single unambiguous definition or a way to deploy projects to production.
However, there are characteristics that are common to all production deployments:
Production deployments should run without human intervention. It is not very practical to use results that require you to execute
run
on your laptop to power serious business processes or products.Production deployments should run reliably in a highly available manner. Results should appear predictably, even if infrastructure encounters spurious failures.
Consider the Metaflow journey
Thus far the steps have involved a human in the loop, from local
development to scalable flows. In
contrast, a defining feature of production deployments is that they are fully automated.
We achieve this by scheduling flows to run automatically on a production-grade
workflow orchestrator, so you don't need to write run
manually to produce the desired
results.
Reliably Running, Automated Flows
What about the second characteristic of production deployments - reliability? Firstly, a big benefit of Stage II is that you can test your workflows at scale, and add reliability-enhancing features, making sure your flows can cope with production-scale workloads. Secondly, your flow needs to be orchestrated by a system that by itself runs reliably, which is harder than it sounds. Such a production-grade orchestrator needs to be
Highly available: The orchestrator itself must not crash, even if a server it runs on hits a random failure.
Highly scalable: We shouldn't have to worry about the number flows orchestrated by the system.
Capable of triggering flows based on different conditions: We should be able to automate flow execution flexibly.
Easy to monitor and operate: To minimize the time spent on occasional human interventions.
Fortunately, a few systems are able to fulfill these requirements, judging by their track record. Metaflow integrates with two of them: Argo Workflows that runs on Kubernetes and AWS Step Functions, a managed service by AWS. As of today, Argo Workflows is the only orchestrator that supports Metaflow's powerful event-triggering functionality, which makes it a good default choice.
In addition, Metaflow integrates with a popular open-source workflow orchestrator, Apache Airflow. While Airflow has more limitations than the two aforementioned orchestrators, it is a good choice if you have many DAGs deployed on it already and you don't want to introduce a new orchestrator in your environment.
While all of these systems are quite complex under the hood, Metaflow makes using them trivial: You can deploy your flows literally with a single command - no changes in the code required. Also, this means that you can switch between schedulers easily. For instance, you can start with Apache Airflow to stay compatible with your existing data pipelines and migrate to Argo Workflows over time without having to pay any migration tax.
Patterns of production deployments
Once flows run reliably, you can leverage the results - like freshly trained models - on various systems:
- You can use deployed workflows as building blocks to compose larger systems using event triggering.
- You can write fresh predictions or other results in a data warehouse, e.g. to power a dashboard.
- You can populate fresh results in a cache e.g. for a recommendation system.
- You can deploy models on a model hosting platform of your choosing, e.g. Seldon or AWS Sagemaker.
The exact pattern depends on your use case. Importantly, creating these integrations become much easier when you can trust your flows to run reliably. We are happy to help you on the Metaflow support Slack to find a pattern that works for your needs.
To Production And Back
While the journey illustration above looks like a linear path from prototype to production, realistically the picture should illustrate loops everywhere. In particular, there is constant interaction between local development and production deployments, as you troubleshoot production issues (inevitably), as well as keep working on newer versions of flows.
When it comes to troubleshooting, a hugely convenient feature is the ability to
resume
failed production runs
locally. Also, remember that
you can inspect the state of any production run with
cards and notebooks, in real-time.
When it comes to working on newer versions, how do you know if a newer version performs
better than the latest production deployment? Often, the best way to answer the question
is to deploy the new version to run concurrently with the existing production version
and compare the results, as in an A/B test. This pattern is enabled by the @project
decorator. Also, Metaflow
tags come in
handy when designing processes around production deployments.
What You Will Learn
You can get a feel of all these concepts and test them hands-on without having to install anything locally by signing up for a Metaflow Sandbox!
In this section, you will learn how to make your flows run automatically without any human intervention.
The basics of scheduling Metaflow flows:
- Depending on the infrastructure you have
installed, pick a section below:
- Scheduling flows with Argo
Workflows
- Choose this if running on Kubernetes.
- Scheduling flows with AWS Step
Functions
- Choose for minimal operational overhead.
- Scheduling flows with Apache
Airflow
- Choose to stay compatible with your existing Airflow deployment.
- Scheduling flows with Argo
Workflows
- Depending on the infrastructure you have
installed, pick a section below:
Coordinating larger Metaflow projects is a more advanced pattern that enables multiple parallel production deployments.
Connecting flows via events shows how you can make workflows start automatically based on real-time events. This pattern allows you to build reactive systems using flows as building blocks.