RE: Teleconference System Failure and Mitigation (post mortem) from John, Anil on 2020-09-19 (public-credentials@w3.org from September 2020)

From: John, Anil <anil.john@hq.dhs.gov>
Date: Sat, 19 Sep 2020 14:29:42 +0000
To: Manu Sporny <msporny@digitalbazaar.com>, "public-credentials@w3.org" <public-credentials@w3.org>
Message-ID: <BLAPR09MB728489ACB80C4DD41D4B97FDC53C0@BLAPR09MB7284.namprd09.prod.outlook.com>

>When the presenter started broadcasting video, the default is to broadcast in the highest definition possible
>which means the server started doing around 300Mbps in video streams to everyone and only had enough 
>memory to cache around 5 seconds of video before running out of memory

As said presenter, I was happy to informally red team your defaults, assumptions and expectations :-)

Best Regards,

Anil

Anil John
Technical Director, Silicon Valley Innovation Program 
Science and Technology Directorate 
US Department of Homeland Security 
Washington, DC, USA 

Email Response Time – 24 Hours

-----Original Message-----
From: Manu Sporny <msporny@digitalbazaar.com> 
Sent: Saturday, September 19, 2020 9:59 AM
To: public-credentials@w3.org
Subject: Teleconference System Failure and Mitigation (post mortem)

CAUTION: This email originated from outside of DHS. DO NOT click links or open attachments unless you recognize and/or trust the sender. Contact your component SOC with questions or concerns.

On 9/18/20 7:56 PM, kimdhamilton@gmail.com wrote:
> TL;DR: NEW AND IMPROVED JITSI, FEATURING MORE RAM.

For those of you that were on last weeks call, we experienced a hard lock-up on the W3C CCG teleconferencing system. Here's what we believe
happened:

1. We had sized the server to optimize for costs, which meant
   we only allocated 4GBs of RAM.
2. We tested this configuration with around 25 people connected
   simultaneously, all broadcasting video. The server was
   stable at ~60% memory usage. We thought we were good.
3. The call last week had 37 people at its peak. We were good
   for the first 30 minutes or so.
4. When the presenter started broadcasting video, the default
   is to broadcast in the highest definition possible, which
   means the server started doing around 300Mbps in video
   streams to everyone and only had enough memory to cache
   around 5 seconds of video before running out of memory. We
   went outside of that envelope, the server locked up hard,
   and that was that. We switched over to Zoom and continued
   the meeting with about 5 minutes of downtime.

We have done the following in an attempt to prevent this from happening
again:

1. Doubled the amount of RAM on the server to 8GB.
2. Added a 16GB swap volume in case we exceed the RAM
   allocated to the machine.

We believe this should address the issue experienced during the last call.

Running in production is the ultimate test on your system design assumptions. :)

-- manu

--
Manu Sporny - https://urldefense.us/v2/url?u=https-3A__www.linkedin.com_in_manusporny_&d=DwICaQ&c=2plI3hXH8ww3j2g8pV19QHIf4SmK_I-Eol_p9P0CttE&r=FUgYmx6LTIaPqn7QR6TBfzml-fqCTpab-djgqlCFtgU&m=HBdDEjrS5CAHaTYUE2BvfrdkKFX6koyF-CId_SXApKI&s=vYOHxA8gBJa371Zit59Ub-lVIWk-yTlVQFz9gg4M2Jo&e=

Founder/CEO - Digital Bazaar, Inc.
blog: Veres One Decentralized Identifier Blockchain Launches https://urldefense.us/v2/url?u=https-3A__tinyurl.com_veres-2Done-2Dlaunches&d=DwICaQ&c=2plI3hXH8ww3j2g8pV19QHIf4SmK_I-Eol_p9P0CttE&r=FUgYmx6LTIaPqn7QR6TBfzml-fqCTpab-djgqlCFtgU&m=HBdDEjrS5CAHaTYUE2BvfrdkKFX6koyF-CId_SXApKI&s=tGgIHOKJJqDl7xVVIZsJiKg8_yH0TiPWsi2KoLzm0NY&e=

Received on Saturday, 19 September 2020 14:30:32 UTC