Using AWS Batch
Here are some useful tips and tricks related to running Metaflow on AWS Batch. See our engineering resources for additional information about setting up and operating AWS Batch for Metaflow.
What value of @timeout should I set?
Metaflow sets a default timeout of 5 days so that you tasks don't get stuck infinitely
while running on AWS Batch. For more details on how to use @timeout please read
this.
How much @resources can I request?
Here are the current defaults for different resource types:
cpu: 1memory: 4000 (4GB)
When setting @resources, keep in mind the configuration of your AWS Batch Compute
Environment. Your job will be stuck in a RUNNABLE state if AWS is unable to provision
the requested resources. Additionally, as a good measure, don't request more resources
than what your workflow actually needs. On the other hand, never optimize resources
prematurely.
You can place your AWS Batch task in a specific queue by using the queue argument. By
default, all tasks execute on a vanilla python docker
image corresponding to the version of Python
interpreter used to launch the flow and can be overridden using the image argument.
You can also specify the resource requirements on command line as well:
$ python BigSum.py run --with batch:cpu=4,memory=10000,queue=default,image=ubuntu:latest
Using GPUs and Trainium instances with AWS Batch
To use GPUs in Metaflow tasks that run on AWS Batch, you need to run the flow in a Job Queue that is attached to a Compute Environment with GPU/Trainium instances.
To set this up, you can either modify the allowable instances in a Metaflow AWS deployment template or manually add such a compute environment from the AWS console. The steps are:
- Create a compute environment with GPU-enabled EC2 instances or Trainium instances.
- Attach the compute environment to a new Job Queue - for example named
my-gpu-queue. - Run a flow with a GPU task in the
my-gpu-queuejob queue by- setting the
METAFLOW_BATCH_JOB_QUEUEenvironment variable, or - setting the
METAFLOW_BATCH_JOB_QUEUEvalue in your Metaflow config, or - (most explicit) setting the
queueparameter in the@batchdecorator.
- setting the
It is a good practice to separate the job queues that you run GPU tasks on from those that do not require GPUs (or Trainium instances). This makes it easier to track hardware-accelerated workflows, which can be costly, independent of other workflows. Just add a line like
@batch(gpu=1, queue='my-gpu-queue')
in steps that require GPUs.
My job is stuck in RUNNABLE state. What should I do?
Does the Batch job queue you are trying to run the Metaflow task in have a compute environment
with EC2 instances with the resources requested? For example, if your job queue is connected to
a single compute environment that only has p3.2xlarge as a GPU instance, and a user requests 2
GPUs, that job will never get scheduled because p3.2xlarge only have 1 GPU per instance.
For more information, see this article.
My job is stuck in STARTING state. What should I do?
Are the resources requested in your Metaflow code/command sufficient? Especially when using custom GPU images, you might need to increase the requested memory to pull the container image into your compute environment.
Listing and killing AWS Batch tasks
If you interrupt a Metaflow run, any AWS Batch tasks launched by the run get killed by Metaflow automatically. Even if something went wrong during the final cleanup, the tasks will finish and die eventually, at the latest when they hit the maximum time allowed for an AWS Batch task.
If you want to make sure you have no AWS Batch tasks running, or you want to manage them
manually, you can use the batch list and batch kill commands.
You can easily see what AWS Batch tasks were launched by your latest run with
$ python myflow.py batch list
You can kill the tasks started by the latest run with
$ python myflow.py batch kill
If you have started multiple runs, you can make sure there are no orphaned tasks still running with
$ python myflow.py batch list --my-runs
You can kill the tasks started by the latest run with
$ python myflow.py batch kill --my-runs
If you see multiple runs running, you can cherry-pick a specific job, e.g. 456, to be killed as follows
$ python myflow.py batch kill --run-id 456
If you are working with another person, you can see and kill their tasks related to this flow with
$ python myflow.py batch kill --user willsmith
Note that all the above commands only affect the flow defined in your script. You can
work on many flows in parallel and be confident that kill kills tasks only related to
the flow you called kill with.
Accessing AWS Batch logs
As a convenience feature, you can also see the logs of any past step as follows:
$ python bigsum.py logs 15/end
Disk space
You can request higher disk space on AWS Batch instances by using an unmanaged Compute Environment with a custom AMI.
How to configure AWS Batch for distributed computing?
See these instructions if you want to use AWS Batch for distributed computing.