Structuring Projects

This page describes how to arrange files in your projects to follow software development best practices, which also leads to easy remote execution.

Separating code to modules

Let's start with a simple case. As your projects grow, it is a good idea to structure business logic and modeling code as separate Python modules, i.e. separate files, instead of packing everything in a single flow file.

Code that is logically structured as bite-sized modules is easy to comprehend, debug, and test using standard Python testing tools like unittest. Importantly, modules can be shared across flows, allowing you to create reusable libraries of shared functionality.

To see this in action, create a Python file called teatime.py with the following contents:

from datetime import datetime

def is_tea_time():
    return 15 < datetime.utcnow().hour < 18

Create a separate file, teaflow.py with the following contents:

from metaflow import FlowSpec, step

class TeaFlow(FlowSpec):

    @step
    def start(self):
        import teatime
        self.tea_time = teatime.is_tea_time()
        self.next(self.end)

    @step
    def end(self):
        print(f'is it tea time? {self.tea_time}')

if __name__ == '__main__':
    TeaFlow()

Note how we import teatime in the start step to use the module. You can run the flow as usual:

python teaflow.py run

The teatime.py module works out of the box. If you have remote execution set up, you can run the code --with batch or --with kubernetes and it works equally well!

Remote execution works since Metaflow packages all .py files in the same directory as the flow file, as well as its subdirectories, automatically. You can see what files are being included by executing

python teaflow.py package list

Separating code to packages

A Python package is a library that consists of multiple modules stored in a common directory. These work out of the box with Metaflow too.

To test a local package, crumpet, create a directory structure with the following files:

crumpetflow.py
crumpet/__init__.py
crumpet/teatime.py
crumpet/raisin.py

Here, __init__.py should be an empty file. It tells Python that crumpet should be treated as a package. You can copy the teatime.py module from the example above.

The raisin.py file should look like follows:

import random

def is_dry():
    return random.random() > 0.5

and crumpetflow.py like this:

from metaflow import FlowSpec, step

class CrumpetFlow(FlowSpec):

    @step
    def start(self):
        from crumpet import teatime, raisin
        self.tea_time = teatime.is_tea_time()
        self.is_dry = raisin.is_dry()
        self.next(self.end)

    @step
    def end(self):
        print(f'is it tea time? {self.tea_time}')
        print(f'are raisins dry? {self.is_dry}')

if __name__ == '__main__':
    CrumpetFlow()

You can execute the flow as usual

python crumpetflow.py run

As before, you can run the code remotely --with kubernetes or --with batch and it works without you having to worry about packaging and installing the crumpet package manually.

Common packages shared across flows

You could add other flows next to crumpetflow.py in the above directory hierarchy and they would all share the common crumpet package.

As a project grows, it may become desirable to separate each flow in a subdirectory of its own, so each person or a team can manage their files independently. All of them may share one or more common packages.

For instance, we can have two flows, crumpetflow and teatimeflow as independent subdirectories with their own READMEs, as well as a shared crumpet package:

crumpetflow/flow.py
crumpetflow/README.md
teatimeflow/flow.py
teatimeflow/README.md
crumpet/__init__.py
crumpet/teatime.py
crumpet/raisin.py

Unfortunately this wouldn't work out of the box since Metaflow packages only the files under the crumpetflow and teatimeflow directory, ignoring crumpet by default.

The solution is to including a symbolic link (symlink) in each flow directory, pointing at the common package that should be included. You can create a symlink as follows

cd crumpetflow
ln -s ../crumpet .

With symlinks included the hierarchy looks like this:

crumpetflow/flow.py
crumpetflow/README.md
crumpetflow/crumpet -> ../crumpet
teatimeflow/flow.py
teatimeflow/README.md
teatimeflow/crumpet -> ../crumpet
crumpet/__init__.py
crumpet/teatime.py
crumpet/raisin.py

Using common packages from other Git repositories

The above hierarchy works well when everything is stored in a single Git repository. Technically, you could have crumpetflow, teatimeflow, and crumpet as separate repositories as well, but you would need to ensure that symlinks stay valid between repositories. This can be fragile.

If you want to include a package from a separate repository, a better approach is to use the git subtree command, which is an enhanced version of Git submodules. With git subtree you can nest a repository as a subdirectory of another repository. For instance, the crumpet package could be a repository of its own, included as a subtree in every flow project that wants to use it.

Alternatively, you can publish the package as a private Python package which you can include with @pypi or @conda.

Non-Python dependencies

By default, Metaflow packages .py files in the flow's directory hierarchy. You can also include arbitrary files in the package for remote execution by including their file suffices in the --package-suffixes option.

For instance, the example below shows how to include SQL files but you could also include custom binaries or configuration files.

Create a directory sql and store two files there:

sql/populate_table.sql

CREATE TABLE IF NOT EXISTS movie(title, year, score)
INSERT INTO movie VALUES ('Moana', 2016, 7.6), ('Scream', 2022, 6.3)

sql/get_movies.sql

SELECT title, year FROM movie

Here's an example flow that accesses the SQL files, save it in moviesqlflow.py:

from metaflow import FlowSpec, step

class MovieSQLFlow(FlowSpec):

    @step
    def start(self):
        import sqlite3
        with sqlite3.connect("movies.db") as conn:
            cur = conn.cursor() 
            with open('sql/populate_table.sql') as f:
                for line in f:
                    cur.execute(line)
            with open('sql/get_movies.sql') as f:
                self.movies = cur.execute(f.read()).fetchall()
        self.next(self.end)

    @step
    def end(self):
        print("Found movies:")
        for title, year in self.movies:
            print(title, year)

if __name__ == '__main__':
    MovieSQLFlow()

Note that you need to refer to packages files through relative directories, like sql/ above, instead of absolute paths like /home/alice/sql/. The directory layout is made available remotely but not in any guaranteed absolute location.

Execute the code as follows:

python moviesqlflow.py --package-suffixes .sql run

Locally, it would work without --package-suffixes but when running remotely --with batch or --with kubernetes, it would complain about missing .sql files unless --package-suffixes is specified.

You can confirm that all dependencies are included properly by executing

python moviesqlflow.py --package-suffixes .sql package list

Structuring Projects

Separating code to modules​

Separating code to packages​

Common packages shared across flows​

Using common packages from other Git repositories​

Non-Python dependencies​

Separating code to modules

Separating code to packages

Common packages shared across flows

Using common packages from other Git repositories

Non-Python dependencies