w3id.org downtime post-mortem

Last night, we suffered the biggest outage to date of w3id.org. The
problem was rooted in an unscheduled switch-over of DNS associated with
a domain transfer of w3id.org.

Here's a summary of what happened:

1. A bug was filed on the site to ask why the domain isn't pre-paid
years in advance.
2. A decision was made to move domain registrars to a) enable more
organizations to control the DNS record and b) pre-pay the domain for
the maximum amount of time allowable.
3. The domain request was approved and it took a few days to transfer
the domain.
4. A selection was made during the domain transfer process to migrate
the DNS entries over as well, but the migration failed for an unknown
reason.
5. This resulted in the DNS records being wiped out when the domain was
moved. This happened around 2am ET today.
6. The problem was reported at 3am ET as the old DNS entries timed out
and the new ones took its place.
7. w3id.org admins started to take action at 8am ET and the problem was
resolved. Presently, we're waiting for the new entries to propagate
around the world.

What failed:

1. The DNS entries were not migrated, this was a bug in the new
registrars process or an oversight with the admin (me) that initiated
the migration.
2. Our monitoring software was monitoring HTTP instead of HTTPS. By
default, the registrar sets up an HTTP server for the domain and since
the domain was responding with a 200 OK, our monitoring software thought
everything was ok.

What has been done as a result of the outage:

1. w3id.org has a new registrar and the domain registration has been
extended until the year 2024. It is unlikely that we'll move registrars
again.
2. Our monitoring software has been updated to monitor HTTPS as well.

What still needs to be done:

1. We will monitor that the new DNS entry propagation happens as
expected over the next several hours.
2. We need to distribute the login details to the rest of the trustees
associated with w3id.org.
3. We need to add the trustees to the monitoring software warning list.

All the gory details were tracked here (updates will be posted here as
well):

https://github.com/perma-id/w3id.org/issues/81

-- manu

-- 
Manu Sporny (skype: msporny, twitter: manusporny, G+: +Manu Sporny)
Founder/CEO - Digital Bazaar, Inc.
blog: High-Stakes Credentials and Web Login
http://manu.sporny.org/2014/identity-credentials/

Received on Wednesday, 13 May 2015 15:12:30 UTC