Deploying Infrastructure for Metaflow

Use the local dev stack to explore how Metaflow integrates with underlying infrastructure. When you are ready for a production deployment, you will need to set up infrastructure in your own cloud account, as detailed on this page. For further information, see Metaflow Resources for Engineers.

Supported infrastructure components

Since modern data science / ML applications are powered by a number of interconnected systems, it is useful to organize them as an infrastructure stack like the one illustrated below (Why? See here). You can see logos of all supported systems which you can use to enable each layer.

The table below explains the five major deployment options for Metaflow and what components of the stack are supported in each. You can choose to deploy Metaflow on:

Only local environment - just pip install metaflow on any workstation.
AWS either on EKS as a Kubernetes platform or using AWS-managed services.
Azure on AKS as a Kubernetes platform.
Google Cloud on GKE as a Kubernetes platform.
Any Kubernetes cluster including on-premise deployments.

Layer	Component	Description	Only Local	AWS	Azure	GCP	K8s
Modeling	Python libraries	Any Python libraries	🟢	🟢	🟢	🟢	🟢
Deployment	Argo Workflows	Open-source production-grade workflow orchestrator		🟢	🟢	🟢	🟢
Deployment	Step Functions	AWS-managed production-grade workflow orchestrator		🟢
Deployment	Apache Airflow	Popular open-source workflow orchestrator		🟢	🟢	🟢	🟢
Versioning	Local Metadata	Metaflow's tracking in local files	🟢	🟢	🟢	🟢	🟢
Versioning	Metadata Service	Metaflow's tracking in a central database		🟢	🟢	🟢	🟢
Orchestration	Local Orchestrator	Metaflow's local workflow orchestrator	🟢	🟢	🟢	🟢	🟢
Compute	Local Processes	Metaflow tasks as local processes	🟢	🟢	🟢	🟢	🟢
Compute	AWS Batch	AWS-managed batch compute service		🟢
Compute	Kubernetes	Open-source batch compute platform		🟢	🟢	🟢	🟢
Data	Local Datastore	Metaflow artifacts in local files	🟢	🟢	🟢	🟢	🟢
Data	AWS S3	Metaflow artifacts in AWS-managed storage		🟢			🟢
Data	Azure Blob Storage	Metaflow artifacts in Azure-managed storage			🟢		🟢
Data	Google Cloud Storage	Metaflow artifacts in Google-managed storage				🟢	🟢

Note that fast prototyping with the Local Orchestrator is supported in all these options, but the only local option doesn't support scalability with an external compute layer, nor production-grade deployments.

info

You can test the AWS/Azure/GCP/Kubernetes stack easily in your browser for free by signing up for a Metaflow Sandbox.

Example stacks

Here are some typical deployments that we have seen in action:

Local: Effortless prototyping

Just pip install metaflow to deploy this stack

This is the stack you get by default when you install Metaflow locally. It's main benefit is zero configuration and maintenance - it works out of the box. It is a great way to get started with Metaflow.

When you want to start collaborating with multiple people which requires a central metadata service, or you want to start running larger-scale workloads, or you want to deploy your workflows so that they run even when your laptop is asleep, look into more featureful stacks below.

Low-maintenance scalable prototyping, powered by AWS

Click here to deploy this stack

If you are looking for the easiest and the most affordable way to scale out compute to the cloud, including cloud-based GPUs, this stack is a great option. Consider the benefits:

Artifacts are stored in AWS S3, so you don't have to worry about running out of storage or losing data.
Scalability is managed by AWS Batch which requires no maintenance after the initial setup.
AWS Batch is very cost-effective: You pay only for the EC2 instance time used by second, with no additional costs. To reduce the cost of compute even further, you can leverage spot instances.

In this stack, the main missing piece is a highly-available workflow orchestrator which you can easily add by upgrading to the option below. Also, larger teams with more involved compute needs may find AWS Batch limiting, in which case you can look into Kubernetes-based stacks.

Low-maintenance full stack, powered by AWS

Click here to deploy this stack

If you need the full stack of data science/ML infrastructure but want to spend a minimal amount of effort to set up and manage it, choose this option. You get all the benefits of AWS Batch as described above, as well as production deployments on AWS Step Functions which is a highly-available, scalable workflow orchestrator managed by AWS. Metaflow tracks everything in a central metadata service, making collaboration straightforward.

Here are the main reasons for not using this stack:

You want to use another cloud besides AWS.
You need a more customizable workflow orchestrator and a compute platform than what the AWS-managed services can provide.
This stack doesn't support event-triggering. If this feature is important to you, consider using one of the Kubernetes-based stacks.

Customizable full stack on AWS, powered by Kubernetes

Click here to deploy this stack

If your engineering team has prior experience with Kubernetes, they might prefer a familiar stack that works with their existing security policies, observability tools, and deployment mechanisms. In this case, this Kubernetes-native stack featuring compute on Kubernetes and deployments on reliable, scalable, open-source Argo Workflows is a good option.

This stack can be easily deployed on EKS on AWS, leveraging S3 as the datastore. Alternatively, some companies run this stack on-premise using Minio as an S3-compatible datastore.

This stack requires more maintenance than the AWS-native stack above, although the basic setup is quite manageable if your organization is already familiar with Kubernetes.

Customizable full stack on Azure, powered by Kubernetes

Click here to deploy this stack

If you need a full-stack DS/ML platform on Azure, this Kubernetes-based stack is a good option. It is the same stack as the one running on EKS on AWS, with the S3-based datastore replaced with Azure Blob Storage.

This stack incurs a typical maintenance overhead of an AKS-based Kubernetes cluster, which shouldn't add much burden if your organization uses AKS already.

Customizable full stack on Google Cloud, powered by Kubernetes

Click here to deploy this stack

If you need a full-stack DS/ML platform on Google Cloud, this Kubernetes-based stack is a good option. It is the same stack as the one running on EKS on AWS, with the S3-based datastore replaced with Google Cloud Storage.

This stack incurs a typical maintenance overhead of an GKE-based Kubernetes cluster, which shouldn't add much burden if your organization uses GKE already.

Deploying Infrastructure for Metaflow

Supported infrastructure components​

Example stacks​

Local: Effortless prototyping​

Low-maintenance scalable prototyping, powered by AWS​

Low-maintenance full stack, powered by AWS​

Customizable full stack on AWS, powered by Kubernetes​

Customizable full stack on Azure, powered by Kubernetes​

Customizable full stack on Google Cloud, powered by Kubernetes​

Supported infrastructure components

Example stacks

Local: Effortless prototyping

Low-maintenance scalable prototyping, powered by AWS

Low-maintenance full stack, powered by AWS

Customizable full stack on AWS, powered by Kubernetes

Customizable full stack on Azure, powered by Kubernetes

Customizable full stack on Google Cloud, powered by Kubernetes