Metaflow wants to make debugging failed flows as painless as possible.
Debugging issues during development is a normal part of the development process. You should be able to develop and debug your Metaflow scripts similar to how you develop any Python scripts locally.
The process of debugging failed flows is similar both for development-time and production-time issues:
Identify the step that failed. The failed step is reported as the last line of the error report where it is easy to spot.
Identify the run id of the failed run. On the console output, each line is prefixed with an identifier like
2 is the run id,
start is the step name, and
21426 is the task id.
Reproduce the failed run with
resume as described below. Confirm that the error message you get locally matches to the original error message.
Identify the failed logic inside the failed step. You can do this by adding
resume reveals enough information. Alternatively, you can reproduce the faulty logic in a notebook using input data artifacts for the step, as described below in the section about notebooks.
Confirm that the fix works with
resume. Return to 4 if the error has not been fixed.
When the step works locally, rerun the whole flow from
end and confirm that the fix works as intended.
resume command allows you to resume execution of a past run at a failed step. Resuming makes it easy to quickly reproduce the failure and iterate on the step code until a fix has been found.
Here is how it works. First, save the snippet below :
from metaflow import FlowSpec, stepclass DebugFlow(FlowSpec):@stepdef start(self):self.next(self.a, self.b)@stepdef a(self):self.x = 1self.next(self.join)@stepdef b(self):self.x = int('2fail')self.next(self.join)@stepdef join(self, inputs):print('a is %s' % inputs.a.x)print('b is %s' % inputs.b.x)print('total is %d' % sum(input.x for input in inputs))self.next(self.end)@stepdef end(self):passif __name__ == '__main__':DebugFlow()
Run the script with:
python debug.py run
The run should fail. The output should look like:
...2018-01-27 22:59:40.313 [3/b/21638 (pid 13720)] File "debug.py", line 17, in b2018-01-27 22:59:40.313 [3/b/21638 (pid 13720)] self.x = int('2fail')2018-01-27 22:59:40.314 [3/b/21638 (pid 13720)] ValueError: invalid literal for int() with base 10: '2fail'2018-01-27 22:59:40.314 [3/b/21638 (pid 13720)]2018-01-27 22:59:40.361 [3/a/21637 (pid 13719)] Task finished successfully.2018-01-27 22:59:40.362 [3/b/21638 (pid 13720)] Task failed.2018-01-27 22:59:40.362 Workflow failed.Step failure:Step b (task-id 21638) failed.
This shows that the step
b of the run
3 failed. In your case, the run id could be different.
resume command runs the flow similar to
run. However, in contrast to
run resuming reuses results of every successful step instead of actually running them.
Try it with
python debug.py resume
Metaflow remembers the run number of the last local run, which in this case is
3, so you should see
resume reusing results of the run above. Since we have not changed anything yet, you should see the above error again but with an incremented run number.
You can also resume a specific run using the CLI option
--origin-run-id if you don't like the default value selected by Metaflow. To get the same behavior as above, you can also do:
python debug.py resume --origin-run-id 3
If you'd like programmatic access to the
--origin-run-id selected for the
resume (either implicitly selected by Metaflow as last
run invocation, or explicitly declared by the user via the CLI), you can use the
current singleton. Read more here.
Next, fix the error by replacing
int('2'). Try again after the fix. This time, you should see the flow completing successfully.
Resuming uses the flow and step names to decide what results can be reused. This means that the results of previously successful steps will get reused even if you change their step code. You can add new steps and alter code of failed steps safely with
resume resumes from the step that failed, like
b above. Sometimes fixing the failed step requires re-execution of some steps that precede it.
You can choose the step to resume from by specifying the step name on the command line:
python debug.py resume start
This would resume execution from the step
start. If you specify a step that comes after the step that failed, execution resumes from the failed step - you can't skip over steps.
If your flow has
Parameters, you can't change their values when resuming. Changing parameter values could change the results of any steps, including those that
resume skips over, which could result to unexpected behavior in subsequent steps.
resume command reuses the parameter values that you set with
The above example demonstrates a trivial error. In the real life, errors can be much trickier to debug. In the case of machine learning, a flow may fail because of an unexpected distribution of input data, although nothing is wrong with the code per se.
Being able to inspect data produced by every step is a powerful feature of Metaflow which can help in situations like this.
This clip (no audio) demonstrates inspecting values in a flow:
In the above clip, you will see:
In a Jupyter notebook, you can list all the flows and select the latest run of the Episode 1 flow;
Further, you can select the
genre_movies step from this flow and inspect its value. As you can see, the value computed at that step is fully available via the Client API and this works for any completed step even steps that completed successfully in a failed run.
For more details about the notebook API, see the Client API.