Teleconference System Failure and Mitigation (post mortem)

On 9/18/20 7:56 PM, wrote:

For those of you that were on last weeks call, we experienced a hard
lock-up on the W3C CCG teleconferencing system. Here's what we believe

1. We had sized the server to optimize for costs, which meant
   we only allocated 4GBs of RAM.
2. We tested this configuration with around 25 people connected
   simultaneously, all broadcasting video. The server was
   stable at ~60% memory usage. We thought we were good.
3. The call last week had 37 people at its peak. We were good
   for the first 30 minutes or so.
4. When the presenter started broadcasting video, the default
   is to broadcast in the highest definition possible, which
   means the server started doing around 300Mbps in video
   streams to everyone and only had enough memory to cache
   around 5 seconds of video before running out of memory. We
   went outside of that envelope, the server locked up hard,
   and that was that. We switched over to Zoom and continued
   the meeting with about 5 minutes of downtime.

We have done the following in an attempt to prevent this from happening

1. Doubled the amount of RAM on the server to 8GB.
2. Added a 16GB swap volume in case we exceed the RAM
   allocated to the machine.

We believe this should address the issue experienced during the last call.

Running in production is the ultimate test on your system design
assumptions. :)

-- manu

Manu Sporny -
Founder/CEO - Digital Bazaar, Inc.
blog: Veres One Decentralized Identifier Blockchain Launches

Received on Saturday, 19 September 2020 13:58:58 UTC