Dealing with Failures
Failures are a natural, expected part of data science workflows. Here are some typical reasons why you can expect your workflow to fail:
- Misbehaving code: no code is perfect. Your code may fail to handle edge cases or libraries behave differently than what you expected.
- Unanticipated issues with data: data is harder than science. Data is how Metaflow workflows interact with the chaotic, high entropy, outside world. It is practically impossible to anticipate all possible ways the input data can be broken.
- Platform issues: the best infrastructure is invisible. Unfortunately, every now and then platforms that Metaflow relies on, or Metaflow itself, make their existence painfully obvious by failing in creative ways.
Metaflow provides straightforward tools for you to handle all these scenarios. If you are in a hurry, see a quick summary of the tools.
Retrying Tasks with the retry
Decorator
Retrying a failed task is the simplest way to try to handle errors. It is a particularly effective strategy with platform issues which are typically transient in nature.
You can enable retries for a step simply by adding retry
decorator in the step, like
here:
library(metaflow)
start <- function(self){
n <- rbinom(n=1, size=1, prob=0.5)
if (n==0){
stop("Bad Luck!")
} else{
print("Lucky you!")
}
}
end <- function(self){
print("Phew!")
}
metaflow("RetryFlow") %>%
step(step="start",
decorator("retry"),
r_function=start,
next_step="end") %>%
step(step="end",
r_function=end) %>%
run()
When you run this flow, you will see that sometimes it succeeds without a hitch but
sometimes the start
step raises an exception and needs to be retried. By default,
retry
retries the step three times. Thanks to retry
, this workflow will almost
always succeed.
2020-06-19 15:48:16.653 [181/start/1076 (pid 54076)] Task is starting.
2020-06-19 15:48:22.441 [181/start/1076 (pid 54076)] Evaluation error: Bad Luck!.
2020-06-19 15:48:23.408 [181/start/1076 (pid 54076)] Task failed.
2020-06-19 15:48:23.602 [181/start/1076 (pid 54106)] Task is starting (retry).
2020-06-19 15:48:29.434 [181/start/1076 (pid 54106)] Evaluation error: Bad Luck!.
2020-06-19 15:48:30.665 [181/start/1076 (pid 54106)] Task failed.
2020-06-19 15:48:30.902 [181/start/1076 (pid 54139)] Task is starting (retry).
2020-06-19 15:48:34.704 [181/start/1076 (pid 54139)] [1] "Lucky you!"
2020-06-19 15:48:38.545 [181/start/1076 (pid 54139)] Task finished successfully.
It is highly recommended that you use retry
every time you run your flow on the
cloud. Instead of annotating every step with a
retry decorator, you can also automatically add a retry decorator to all steps that do
not have one as follows:
- Terminal
- RStudio
Rscript retryflow.R run --with retry
# Replace run() in retryflow.R with
# run(with = c("retry"))
# and execute in RStudio
How to Prevent Retries
If retries are such a good idea, why not enable them by default for all steps? First, retries only help with transient errors, like sporadic platform issues. If the input data or your code is broken, retrying will not help anything. Secondly, not all steps can be retried safely.
Imagine a hypothetical step like this:
withdraw_money_from_account <- function(self){
library(httr)
r <- POST('bank.com/account/123/withdraw',
body=list(amount = 1000))
}
If you run this code with:
- Terminal
- RStudio
Rscript moneyflow.R run --with retry
# Replace run() in moneyflow.R with
# run(with = c("retry"))
# and execute in RStudio
you may end up withdrawing up to $4000 instead of the intended $1000. To make sure no
one will accidentally retry a step with destructive side effects like this, you should
add times=0
in the step code:
metaflow("MoneyFlow") %>%
...
step(step="withdraw",
decorator("retry", times=0),
r_function=withdraw_money_from_account,
next_step="end") %>%
...
Now the code can be safely rerun, even using --with retry
. All other steps will be
retried as usual.
Most data science workflows do not have to worry about this. As long as your step only
reads and writes Metaflow artifacts and/or performs only read-only operations with
external systems (e.g. performs only SELECT
queries, no INSERT
s etc.), your step
is idempotent and
can be retried without concern.
Maximizing Safety
By default, retry
will retry the step for three times before giving up. It waits for 2
minutes between retries on the cloud. This
means that if your code fails fast, any transient platform issues need to get resolved
in less than 10 minutes or the whole run will fail. 10 minutes is typically more than
enough, but sometimes you want both a belt and suspenders.
If you have a sensitive production workflow which should not fail easily, there are four things you can do:
- You can increase the number of retries to
times=4
, which is the maximum number of retries currently. - You can make the time between retries arbitrarily long, e.g.
times=4, minutes_between_retries=20.
This will give the task over an hour to succeed. - You can use
catch
, described below, as a way to continue even after all retries have failed.
You can use any combination of these four techniques, or all of them together, to build rock-solid workflows.
Results of Retries
If the same code is executed multiple times by retry
, are there going to be duplicate
artifacts? No, Metaflow manages retries so that only artifacts from the last retry are
visible. If you use the Client API to inspect results, you don't have to do
anything special to deal with retries that may have happened. Each task will have only
one set of results. Correspondingly, the logs returned by task
show the output of the
last attempt only.
If you want to know if a task was retried, you can retrieve retry timestamps from Task
metadata:
task <- task_client$new("RetryFlow/181/start/1076")
print(task$metadata_dict$attempt)
Catching Exceptions with the catch
Decorator
As mentioned above, retry
is an appropriate tool for dealing with transient issues.
What about issues that are not transient? Metaflow has another decorator, catch
that
catches any exceptions that occur during the step and then continues execution of
subsequent steps.
The main upside of catch
is that it can make your workflows extremely robust: it will
handle all error scenarios from faulty code and faulty data to platform issues. The main
downside is that your code needs to be modified so that it can tolerate faulty steps.
Let's consider issues caused by your code versus everything surrounding it separately.
Exceptions Raised by Your Code
By default, Metaflow stops execution of the flow when a step fails. It can't know what to do with failed steps automatically.
You may know that some steps are error-prone. For instance, this can happen with a step inside a foreach loop that iterates over unknown data, such as the results of a query or a parameter matrix. In this case, it may be desirable to let some tasks fail without crashing the whole flow.
Consider this example that is structured like a hyperparameter search:
library(metaflow)
start <- function(self){
self$params <- c(-1, 2, 3)
}
sanity_check <- function(self){
print(self$input)
if (self$input < 0) {
stop("input cannot be negative")
}
}
join <- function(self, inputs){
for (input in inputs){
if (!is.null(input$compute_failed)){
print(paste0("Exception happened for param: ", input$input))
print("Exception message:")
print(input$compute_failed)
}
}
}
metaflow("CatchFlow") %>%
step(step = "start",
r_function = start,
next_step = "sanity_check",
foreach = "params") %>%
step(step = "sanity_check",
decorator("catch", var="compute_failed", print_exception=FALSE),
r_function = sanity_check,
next_step = "join") %>%
step(step = "join",
r_function = join,
next_step = "end",
join = TRUE) %>%
step(step = "end") %>%
run()
As you can guess, the above flow raises an error. Normally, this would crash the whole
flow. However, in this example the catch
decorator catches the exception and stores it
in an instance variable called compute_failed
, and lets the execution continue. The
next step, join
, contains logic to handle the exception.
The var
argument is optional. The exception is not stored unless you specify it. You
can also specify print_exception=FALSE
to prevent the catch
decorator from printing
out the caught exception on stdout.
Platform Exceptions
You could have dealt with the above error by wrapping the whole step in a tryCatch
block. In effect, this is how catch
deals with errors raised in the user code.
In contrast, platform issues happen outside of your code, so you can't handle them with
a tryCatch
block. For instance, an AWS Batch container may fail to start, a server the
step runs on may fail abruptly, or your code may get killed if it uses too much memory.
Let's simulate a platform issue like these with the following flow that kills itself without giving R a chance to recover:
library(metaflow)
start <- function(self) {
library(tools)
pskill(Sys.getpid())
}
end <- function(self) {
message(
"Error is : ", self$start_failed
)
}
metaflow("SuicidalFlowR") %>%
step(
decorator("catch", var = "start_failed"),
decorator("retry"),
step = "start",
r_function = start,
next_step = "end"
) %>%
step(
step = "end",
r_function = end
) %>%
run()
Note that we use both retry
and catch
above. retry
attempts to run the step three
times, hoping that the issue is transient. The hope is futile. The task kills itself
every time.
After all retries are exhausted, catch
takes over and records an exception in
start_failed
, notifying that all attempts to run start
failed. Now it is up to the
subsequent steps, end
in this case, to deal with the scenario that start
produced no
results whatsoever. They can choose an alternative code path using the variable assigned
by catch
, start_failed
in this example.
Summary
Here is a quick summary of failure handling in Metaflow:
- Use
retry
to deal with transient platform issues. You can do this easily on the command line with the--with retry
option. - Use
retry
withcatch
for extra robustness if you have modified your code to deal with faulty steps which are handled bycatch
. - Use
catch
withoutretry
to handle steps that can't be retried safely. It is a good idea to usetimes=0
forretry
in this case.