[pr-preview] PR Preview outage post-mortem

Hi folks,

Here's a post-mortem following the recent outage of PR Preview

## Current situation ##
The problem is fully resolved and PR PReview is working normally.

To retrigger a build, please edit the body of the pull request on GitHub
(e.g. just a line break).

## Duration of outage ##
Multiple days. The outage was so long because the alert system failed to
handle this as an error case.

## Cause of outage ##
The outage was due to Github having deprecated an API endpoint way in
advance of the announced plan[1].

The problem was compounded by the error message being logged not containing
the keyword "error". As a result, it wasn't picked by the email alert
system, and I wasn't made aware of it until Anne filed an issue against the
repository[2].

## Solution ##
The cause of the outage was quickly identified. The solution was to upgrade
to the new API endpoint. This resolved the issue right away.

A support ticket was filed with GitHub to let them know about this issue.

## Follow-ups ##

- Issues were filed to (1) make sure that error handling would catch such
cases in the future[3], and (2) improve logging so that failed builds could
be re-run once the problem was solved[4].
- Searching for funding to pay for the development of these issues and
others, and cover maintenance costs should be re-prioritized.

Thanks,

--tobie

---
[1]:
https://developer.github.com/changes/2020-04-15-replacing-create-installation-access-token-endpoint/
[2]: https://github.com/tobie/pr-preview/issues/73
[3]: https://github.com/tobie/pr-preview/issues/74
[4]: https://github.com/tobie/pr-preview/issues/75

Received on Saturday, 27 June 2020 13:54:59 UTC