- From: Tobie Langel <tobie@unlockopen.com>
- Date: Sat, 27 Jun 2020 15:54:10 +0200
- To: spec-prod@w3.org
- Message-ID: <CAKgxSaCtLKhXuOBMP4R5wW1KB7FZGz5G0mxs8OCUDX1yqzuO1g@mail.gmail.com>
Hi folks, Here's a post-mortem following the recent outage of PR Preview ## Current situation ## The problem is fully resolved and PR PReview is working normally. To retrigger a build, please edit the body of the pull request on GitHub (e.g. just a line break). ## Duration of outage ## Multiple days. The outage was so long because the alert system failed to handle this as an error case. ## Cause of outage ## The outage was due to Github having deprecated an API endpoint way in advance of the announced plan[1]. The problem was compounded by the error message being logged not containing the keyword "error". As a result, it wasn't picked by the email alert system, and I wasn't made aware of it until Anne filed an issue against the repository[2]. ## Solution ## The cause of the outage was quickly identified. The solution was to upgrade to the new API endpoint. This resolved the issue right away. A support ticket was filed with GitHub to let them know about this issue. ## Follow-ups ## - Issues were filed to (1) make sure that error handling would catch such cases in the future[3], and (2) improve logging so that failed builds could be re-run once the problem was solved[4]. - Searching for funding to pay for the development of these issues and others, and cover maintenance costs should be re-prioritized. Thanks, --tobie --- [1]: https://developer.github.com/changes/2020-04-15-replacing-create-installation-access-token-endpoint/ [2]: https://github.com/tobie/pr-preview/issues/73 [3]: https://github.com/tobie/pr-preview/issues/74 [4]: https://github.com/tobie/pr-preview/issues/75
Received on Saturday, 27 June 2020 13:54:59 UTC