Re: [w3c/ServiceWorker] consider allowing multiple worker thread instances for a single registration (#756)

I've spent the last 2 weeks or so trying to understand the performance impact, if any, of allowing multiple service worker thread instances.

**TL;DR**:  I believe allowing multiple service worker instances could provide some measurable benefit in multi-process browser architectures.

The details:

For this investigation I mainly used Jake's excellent benchmark here:

https://jakearchibald.github.io/service-worker-benchmark/

I tested the following browsers:

* Firefox 51 in our current default configuration of a parent process and a single content process (renderer).  I will refer to this as FF51.
* A modified firefox 51 that spawns N threads and dispatches events to them in a round-robin fashion. I'll call this "FF51 Pool(N)" where N is the number of threads.  So "FF51 Pool(16)" for 16 threads, etc.
* A modified firefox 51 that executes the service worker in a separate process.  This reflects the kind of architectural changes we need to make to support multiple content processes (renderers) while only spawning a single service worker instance.  I'll call this "FF51 multi-e10s".
* Chrome 52
* Chrome Canary 54

All of my tests were on a developer grade windows 10 desktop.  Future testing should include mobile, but some of my tests were not easily achievable in FF for android at the moment.

I tried to answer two questions in this investigation:

1. Does a thread pool model provide any performance benefit during load?
2. Does co-locating the service worker thread in the same process as the content window offer any benefit during load?  Here I mean being able to co-locate the SW thread immediately in the same process without any IPC.

In general I only looked at the case where the service worker thread (or threads) was already running.  Clearly cold start is important, but for now I was trying to see if we should consider this at all given the most favorable circumstances.

Let's answer question (1) first.  Does a thread pool help?  Consider these load times for refreshing the benchmark with the cached-fetch script:

FF51 Mean: 311ms
FF51 Median: 302ms

FF51 Pool(2) Mean: 311ms
FF51 Pool(4) Mean: 311ms
FF51 Pool(8) Mean: 310ms
FF51 Pool(16) Mean: 307ms
FF51 Pool(32) Mean: 303ms
FF51 Pool(64) Mean: 306ms

There is perhaps a slight effect here, but its quite small compared to the cost of the number of threads needed.  If you look at the median, however, you can see that there is no real effect:

FF51 Pool(2) Median: 300ms
FF51 Pool(4) Median: 309ms
FF51 Pool(8) Median: 301ms
FF51 Pool(16) Median: 304ms
FF51 Pool(32) Median: 301ms
FF51 Pool(64) Median: 303ms

It seems the slight improvement in the mean times was simply due to reduced outliers.  Again, this does not seem worth the overhead.

**A thread pool does not help a well behaved caching service worker on desktop.**

I should note that the thread pool does help with the "blocking worker" case in the benchmark.  This is expected.  The question is if we should solve that problem with multiple instances or by providing access to SharedWorker or DedicatedWorker to perform heavy CPU work.

Now let's look at question (2).  Are there architectural advantages in allowing service worker threads to run co-located with controlled windows in separate content or renderer processes?

To investigate this we can somewhat compare firefox to chrome.  Firefox currently executes service workers completely in the content or renderer process.  I believe chrome has additional IPC messaging involved with dispatching a FetchEvent because it supports multiple renderer processes.

What does the current data show here:

FF51 Mean: 311ms
FF51 Median: 302ms

Chrome 52 Mean: 376ms
Chrome 52 Median: 383ms

Chrome Canary 54 Mean: 353ms
Chrome Canary 54 Median: 361ms

Remember these numbers are for the reload case where the service worker is already running.

FF51 current runs about 65ms faster than Chrome 52 and 40ms faster than Chrome Canary 54.

But is this due to the content process model and the IPC messaging?  Its hard to say with such different engines.

To test further I adapted some work-in-progress we have to move to a multiple content process architecture for service workers in firefox.  This adds IPC messaging to dispatching the FetchEvent.

The initial patch to add IPC to FF FetchEvent regressed our numbers from 311ms to 450ms.  After additional profiling and optimization I was able to reach this situation:

FF51 multi-e10s Mean: 387ms
FF51 multi-e10s Median: 377ms

This is nearly identical to the current Chrome 52 numbers.

While further optimizations are possible, it seems reasonable to say that requiring IPC to dispatch a FetchEvent adds 50ms to 75ms in current multiple process architectures.  Therefore it does seem desirable to allow service worker thread instances to run co-located in the same process as the controlled window.  It would be nice if the spec allowed this.

Data for the measurements I took can be found here:

https://docs.google.com/spreadsheets/d/1zqFA6JwMNmwL67oSQlSV70Rj8fUZ2QcVvDrUNFePrFI/edit?usp=sharing

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/w3c/ServiceWorker/issues/756#issuecomment-242217294

Received on Wednesday, 24 August 2016 21:41:31 UTC