Skip to main content

Deploying Infrastructure for Metaflow

While you can get started with Metaflow easily on your laptop, the main benefits of Metaflow lie in its ability to scale out to external compute clusters and to deploy to production-grade workflow orchestrators. To benefit from these features, you need to configure Metaflow and the infrastructure behind it appropriately. A separate guide, Metaflow Resources for Engineers covers everything related to such deployments. This page provides a quick overview.

Supported infrastructure components

Since modern data science / ML applications are powered by a number of interconnected systems, it is useful to organize them as an infrastructure stack like the one illustrated below (Why? See here). You can see logos of all supported systems which you can use to enable each layer.

Consider this illustration as a menu that allows you to build your own pizza: You get to customize your own crust, sauce, toppings, and cheese. You can make the choices based on your existing business infrastructure and the requirements and preferences of your organization. Fortunately, Metaflow provides a consistent API for all these combinations, so you can even change the choices later without having to rewrite your flows.

The table below explains the four major deployment options for Metaflow and what components of the stack are supported in each. You can choose to deploy Metaflow on:

  1. Only local environment - just pip install metaflow on any workstation.
  2. AWS either on EKS as a Kubernetes platform or using AWS-managed services.
  3. Azure on AKS as a Kubernetes platform,
  4. Any Kubernetes cluster including on-premise deployments.
LayerComponentDescriptionOnly LocalAWSAzureK8s
Modeling Python librariesAny Python libraries🟢🟢🟢🟢
Deployment Argo WorkflowsOpen-source production-grade workflow orchestrator🟢🟢🟢
Deployment Step FunctionsAWS-managed production-grade workflow orchestrator🟢
Versioning Local MetadataMetaflow's tracking in local files🟢🟢🟢🟢
Versioning Metadata ServiceMetaflow's tracking in a central database🟢🟢🟢
Orchestration Local OrchestratorMetaflow's local workflow orchestrator🟢🟢🟢🟢
Compute Local ProcessesMetaflow tasks as local processes🟢🟢🟢🟢
Compute AWS BatchAWS-managed batch compute service🟢
Compute KubernetesOpen-source batch compute platform🟢🟢🟢
Data Local DatastoreMetaflow artifacts in local files🟢🟢🟢🟢
Data AWS S3Metaflow artifacts in AWS-managed storage🟢🟢
Data Azure Blob StorageMetaflow artifacts in Azure-managed storage🟢🟢

Note that fast prototyping with the Local Orchestrator is supported in all these options, but the only local option doesn't support scalability with an external compute layer, nor production-grade deployments.

info

You can test the AWS/Azure/Kubernetes stack easily in your browser for free by signing up for a Metaflow Sandbox.

Example stacks

Here are some typical deployments that we have seen in action:

Local: Effortless prototyping

Just pip install metaflow to deploy this stack

This is the stack you get by default when you install Metaflow locally. It's main benefit is zero configuration and maintenance - it works out of the box. It is a great way to get started with Metaflow.

When you want to start collaborating with multiple people which requires a central metadata service, or you want to start running larger-scale workloads, or you want to deploy your workflows so that they run even when your laptop is asleep, look into more featureful stacks below.

Low-maintenance scalable prototyping, powered by AWS

Click here to deploy this stack

If you are looking for the easiest and the most affordable way to scale out compute to the cloud, including cloud-based GPUs, this stack is a great option. Consider the benefits:

  • Artifacts are stored in AWS S3, so you don't have to worry about running out of storage or losing data.
  • Scalability is managed by AWS Batch which requires no maintenance after the initial setup.
  • AWS Batch is very cost-effective: You pay only for the EC2 instance time used by second, with no additional costs. To reduce the cost of compute even further, you can leverage spot instances.

In this stack, the main missing piece is a highly-available workflow orchestrator which you can easily add by upgrading to the option below. Also, larger teams with more involved compute needs may find AWS Batch limiting, in which case you can look into Kubernetes-based stacks.

Low-maintenance full stack, powered by AWS

Click here to deploy this stack

If you need the full stack of data science/ML infrastructure but want to spend a minimal amount of effort to set up and manage it, choose this option. You get all the benefits of AWS Batch as described above, as well as production deployments on AWS Step Functions which is a highly-available, scalable workflow orchestrator managed by AWS. Metaflow tracks everything in a central metadata service, making collaboration straightforward.

Here are the main reasons for not using this stack:

  • You want to use another cloud besides AWS.
  • You need a more customizable workflow orchestrator and a compute platform than what the AWS-managed services can provide.

Customizable full stack on AWS, powered by Kubernetes

Click here to deploy this stack

If your engineering team has prior experience with Kubernetes, they might prefer a familiar stack that works with their existing security policies, observability tools, and deployment mechanisms. In this case, this Kubernetes-native stack featuring compute on Kubernetes and deployments on reliable, scalable, open-source Argo Workflows is a good option.

This stack can be easily deployed on EKS on AWS, leveraging S3 as the datastore. Alternatively, some companies run this stack on-premise using Minio as an S3-compatible datastore.

This stack requires more maintenance than the AWS-native stack above, although the basic setup is quite manageable if your organization is already familiar with Kubernetes.

Customizable full stack on Azure, powered by Kubernetes

Click here to deploy this stack

If you need a full-stack DS/ML platform on Azure, this Kubernetes-based stack is a good option. It is the same stack as the one running on EKS on AWS, with the S3-based datastore replaced with Azure Blob Storage.

This stack incurs a typical maintenance overhead of an AKS-based Kubernetes cluster, which shouldn't add much burden if your organization uses AKS already.


If you are unsure about the stacks, just run pip install metaflow to install the local stack and move on to the tutorials. Flows you create will work without changes on any of these stacks.