Skip to main content

Using AWS Batch

Here are some useful tips and tricks related to running Metaflow on AWS Batch. See our engineering resources for additional information about setting up and operating AWS Batch for Metaflow.

What value of @timeout should I set?

Metaflow sets a default timeout of 5 days so that you tasks don't get stuck infinitely while running on AWS Batch. For more details on how to use @timeout please read this.

How much @resources can I request?

Here are the current defaults for different resource types:

  • cpu: 1
  • memory: 4000 (4GB)

When setting @resources, keep in mind the configuration of your AWS Batch Compute Environment. Your job will be stuck in a RUNNABLE state if AWS is unable to provision the requested resources. Additionally, as a good measure, don't request more resources than what your workflow actually needs. On the other hand, never optimize resources prematurely.

You can place your AWS Batch task in a specific queue by using the queue argument. By default, all tasks execute on a vanilla python docker image corresponding to the version of Python interpreter used to launch the flow and can be overridden using the image argument.

You can also specify the resource requirements on command line as well:

$ python BigSum.py run --with batch:cpu=4,memory=10000,queue=default,image=ubuntu:latest

My job is stuck in RUNNABLE state. What do I do?

Consult this article.

Listing and killing AWS Batch tasks

If you interrupt a Metaflow run, any AWS Batch tasks launched by the run get killed by Metaflow automatically. Even if something went wrong during the final cleanup, the tasks will finish and die eventually, at the latest when they hit the maximum time allowed for an AWS Batch task.

If you want to make sure you have no AWS Batch tasks running, or you want to manage them manually, you can use the batch list and batch kill commands.

You can easily see what AWS Batch tasks were launched by your latest run with

$ python myflow.py batch list

You can kill the tasks started by the latest run with

$ python myflow.py batch kill

If you have started multiple runs, you can make sure there are no orphaned tasks still running with

$ python myflow.py batch list --my-runs

You can kill the tasks started by the latest run with

$ python myflow.py batch kill --my-runs

If you see multiple runs running, you can cherry-pick a specific job, e.g. 456, to be killed as follows

$ python myflow.py batch kill --run-id 456

If you are working with another person, you can see and kill their tasks related to this flow with

$ python myflow.py batch kill --user willsmith

Note that all the above commands only affect the flow defined in your script. You can work on many flows in parallel and be confident that kill kills tasks only related to the flow you called kill with.

Accessing AWS Batch logs

As a convenience feature, you can also see the logs of any past step as follows:

$ python bigsum.py logs 15/end

Disk space

You can request higher disk space on AWS Batch instances by using an unmanaged Compute Environment with a custom AMI.