W3C home > Mailing lists > Public > spec-prod@w3.org > April to June 2020

[pr-preview] PR Preview outage post-mortem

From: Tobie Langel <tobie@unlockopen.com>
Date: Sat, 27 Jun 2020 15:54:10 +0200
Message-ID: <CAKgxSaCtLKhXuOBMP4R5wW1KB7FZGz5G0mxs8OCUDX1yqzuO1g@mail.gmail.com>
To: spec-prod@w3.org
Hi folks,

Here's a post-mortem following the recent outage of PR Preview

## Current situation ##
The problem is fully resolved and PR PReview is working normally.

To retrigger a build, please edit the body of the pull request on GitHub
(e.g. just a line break).

## Duration of outage ##
Multiple days. The outage was so long because the alert system failed to
handle this as an error case.

## Cause of outage ##
The outage was due to Github having deprecated an API endpoint way in
advance of the announced plan[1].

The problem was compounded by the error message being logged not containing
the keyword "error". As a result, it wasn't picked by the email alert
system, and I wasn't made aware of it until Anne filed an issue against the

## Solution ##
The cause of the outage was quickly identified. The solution was to upgrade
to the new API endpoint. This resolved the issue right away.

A support ticket was filed with GitHub to let them know about this issue.

## Follow-ups ##

- Issues were filed to (1) make sure that error handling would catch such
cases in the future[3], and (2) improve logging so that failed builds could
be re-run once the problem was solved[4].
- Searching for funding to pay for the development of these issues and
others, and cover maintenance costs should be re-prioritized.



[2]: https://github.com/tobie/pr-preview/issues/73
[3]: https://github.com/tobie/pr-preview/issues/74
[4]: https://github.com/tobie/pr-preview/issues/75
Received on Saturday, 27 June 2020 13:54:59 UTC

This archive was generated by hypermail 2.4.0 : Saturday, 27 June 2020 13:55:00 UTC