Skip to main content

Testing Philosophy

Watch this talk for motivation: Autonomous Testing and the Future of Software Development by Will Wilson.

Metaflow Test Suite

The integration test harness for the core Metaflow at test/core, generates and executes synthetic Metaflow flows, exercising all aspects of Metaflow. The test suite is executed using tox as configured in tox.ini. You can run the tests by hand using pytest or as described below.

What happens when you execute python run? The execution involves multiple layers of the Metaflow stack. The stack looks like following, starting from the most fundamental layer all the way to the user interface:

  1. Python interpreter (python2, python3)
  2. Metaflow core (,, datastore, etc.)
  3. Metaflow plugins (@timeout, @catch, etc.)
  4. User-defined graph
  5. User-defined step functions
  6. User interface (, metaflow.client)

We could write unit tests for functions in the layers 2, 3, and 6, which would capture some bugs. However, a much larger superset of bugs is caused by unintended interactions across the layers. For instance, exceptions caught by the @catch tag (3) inside a deeply nested foreach graph (4) might not be returned correctly in the client API (6) when using Python 3 (1).

The integration test harness included in the core directory tries to surface bugs like this by generating test cases automatically using specifications provided by the developer.


The test harness allows you to customize behavior in four ways that correspond to the layers above:

  1. You define the execution environment, including environment variables, the version of the Python interpreter, and the type of datastore used as contexts in contexts.json (layers 1 and 2).
  2. You define the step functions, the decorators used, and the expected results as MetaflowTest templates, stored in the tests directory (layers 3 and 5).
  3. You define various graphs that match the step functions as simple JSON descriptions of the graph structure, stored in the graphs directory (layer 4).
  4. You define various ways to check the results that correspond to the different user interfaces of Metaflow as MetaflowCheck classes, stored in the metaflow_test directory (layer 6). You can customize which checkers get used in which contexts in context.json.

The test harness takes all contexts, graphs, tests, and checkers and generates a test flow for every combination of them, unless you explicitly set constraints on what combinations are allowed. The test flows are then executed, optionally in parallel, and results are collected and summarized.


Contexts are defined in contexts.json. The file should be pretty self-explanatory. Most likely you do not need to edit the file unless you are adding tests for a new command-line argument.

Note that some contexts have disabled: true. These contexts are not executed by default when tests are run by a CI system. You can enable them on the command line for local testing, as shown below.


Take a look at tests/ This test verifies that artifacts defined in the first step are available in all steps downstream. You can use this simple test as a template for new tests.

Your test class should derive from MetaflowTest. The class variable PRIORITY denotes how fundamental the exercised functionality is to Metaflow. The tests are executed in the ascending order of priority, to make sure that foundations are solid before proceeding to more sophisticated cases.

The step functions are decorated with the @steps decorator. Note that in contrast to normal Metaflow flows, these functions can be applied to multiple steps in a graph. A core idea behind this test harness is to decouple graphs from step functions, so various combinations can be tested automatically. Hence, you need to provide step functions that can be applied to various step types.

The @steps decorator takes two arguments. The first argument is an integer that defines the order of precedence between multiple steps functions, in case multiple step function templates match. A typical pattern is to provide a specific function for a specific step type, such as joins and give it a precedence of 0. Then another catch-all can be defined with @steps(2, ['all']). As the result, the special function is applied to joins and the catch-all function for all other steps.

The second argument gives a list of qualifiers specifying which types of steps this function can be applied to. There is a set of built-in qualifiers: all, start, end, join, linear which match to the corresponding step types. In addition to these built-in qualifiers, graphs can specify any custom qualifiers.

By specifying required=True as a keyword argument to @steps, you can require that a certain step function needs to be used in combination with a graph to produce a valid test case. By creating a custom qualifier and setting required=True you can control how tests get matched to graphs.

In general, it is beneficial to write test cases that do not specify overly restrictive qualifiers and required=True. This way you cast a wide net to catch bugs with many generated test cases. However, if the test is slow to execute and/or does not benefit from a large number of matching graphs, it is a good idea to make it more specific.


The test case is not very useful unless it verifies its results. There are two ways to assert that the test behaves as expected.

You can use a function assert_equals(expected, got) inside step functions to confirm that data inside the step functions is valid. Secondly, you can define a method check_results(self, flow, checker) in your test class, which verifies the stored results after the flow has been executed successfully.


checker.assert_artifact(step_name, artifact_name, expected_value)

to assert that steps contain the expected data artifacts.

Take a look at existing test cases in the tests directory to get an idea how this works in practice.


Graphs are simple JSON representations of directed graphs. They list every step in a graph and transitions between them. Every step can have an optional list of custom qualifiers, as described above.

You can take a look at the existing graphs in the graphs directory to get an idea of the syntax.


Currently, the test harness exercises two types of user interfaces: The command-line interface, defined in, and the Python API, defined in

Currently, you can use these checkers to assert values of data artifacts or log output. If you want to add tests for new types of functionality in the CLI and/or the Python API, you should add a new method in the MetaflowCheck base class and corresponding implementations in and If certain functionality is only available in one of the interfaces, you can provide a stub implementation returning True in the other checker class.


The test harness is executed by running By default, it executes all valid combinations of contexts, tests, graphs, and checkers. This mode is suitable for automated tests run by a CI system.

When testing locally, it is recommended to run the test suite as follows:

cd metaflow/test/core
PYTHONPATH=`pwd`/../../ python --debug --contexts dev-local

This uses only the dev_local context, which does not depend on any over-the-network communication like --metadata=service or --datastore=s3. The --debug flag makes the harness fail fast when the first test case fails. The default mode is to run all test cases and summarize all failures in the end.

You can run a single test case as follows:

cd metaflow/test/core
PYTHONPATH=`pwd`/../../ python --debug --contexts dev-local --graphs single-linear-step --tests BasicArtifactTest

This chooses a single context, a single graph, and a single test. If you are developing a new test, this is the fastest way to test the test.

Coverage report

The test harness uses the coverage package in Python to produce a test coverage report. By default, you can find a comprehensive test coverage report in the coverage directory after the test harness has finished.

After you have developed a new feature in Metaflow, use the line-by-line coverage report to confirm that all lines related the new feature are touched by the tests.