[whatwg] Web Workers from Ryan Heise on 2010-07-21 (public-whatwg-archive@w3.org from July 2010)

From: Ryan Heise <ryan@ryanheise.com>
Date: Thu, 22 Jul 2010 06:11:34 +1000
Message-ID: <20100721201134.GE7577@erno>
Web Workers offer reliability and performance benefits on multi-core
CPUs to a certain class of parallelisable tasks that are primarily
concerned with raw computation over data access. It can clearly be seen
that the lower the ratio of data access to raw computation, the more
performance benefits there are to be gained. This is due to the fact
that data must be serialised and copied to reach a child Web Worker, and
serialised and copied again before the resulting data can be stored in
the parent application.

However, there is also a significant class of parallelisable tasks that
are primarily concerned with data access over raw computation. Many of
these tasks occur in games programming and include array sorting,
texture generation, artificial intelligence and various other kinds of
data analysis. Unfortunately, as the the ratio of data access to raw
computation increases, the utilisation of multi-core CPUs approaches
zero. It is not uncommon in this class of parallelisable tasks for the
bandwidth cost of a postMessage call to equal or even "exceed" the
benefit to be gained from parallel processing. The result of this being
that a data-heavy parallel algorithm based on the Web Workers API might
run at the same speed no matter whether a single core is being utilised
or 8 cores are being utilised. Or worse yet, it can reach the level
where the more cores your CPU has, the slower your algorithm will run.
In such a case, the best performance will be achieved by coding to
utilise only a single core.

To make things more concrete, consider the following scenarios:

1. Implementing a parallel quicksort using Web Workers. The best
performance is achievable with an array of plain numbers. If instead, we
have an array of complex objects which are to be sorted by data keys
within those objects, then we quickly lose the ability to exploit the
multi-core CPU as the size of the objects increase because the cost of
copying the data outweighs any performance benefit from the parallel
execution of (relatively small amount of) raw computation per array
index.

2. Texture generation. The problem is that a simple loop that needs to
generate an array of numbers may do very little processing besides the
actual memory access. A simple example is initialising an array of
random numbers (i.e. a noise texture) which can later be used by image
drawing routines. Similar example textures can be imagined. You might
suggest that this could be worked around by generating the noise texture
in one Web Worker, and generating each other independent texture in
separate Web Workers. However, this approach is not scalable because the
number of CPU cores may increase to 32 or 64 in the future whereas the
number of textures used in the program may remain at 3 or 4. You would
still like to utilise all 64 cores even in this case.

3. Artificial intelligence. Typically an artificial intelligence needs
to analyse data and choose the best option based on the data available,
something that is often parallisable. However, it is also sometimes the
case that each parallel web worker will need to navigate a small portion
of large data set without it necessarily being easy to predict in
advance which small portion it will end up navigating. Hence, each web
worker will require random access to the entire data set.

While Web workers are undoubtedly provide both reliability and
performance benefits to a certain class of parallisable tasks, they are
not sufficient on their own to handle the class of data-heavy
parallelisable tasks described above. Note that things might have been
different had Javascript been a purely functional language. If this were
the case, then there would be much safer and more efficient alternatives
to making whole copies of data that could be implemented under the hood.
That being said, Javascript is not a purely functional language, and so
this approach to concurrency will come at a significant cost when
dealing with this specific class of parallelisable tasks.

Is this an important issue to address? I believe, yes. In my view,
Javascript is a general purpose, multi-paradigm language. It does not
really try to enforce any particular style of programming upon you and
supports many. Limiting Javascript's concurrency offer to Web Workers
would seem to be out of line with this.

Secondly, just as it is the case with Flash, we should expect that a
popular use of HTML5 will be games development. As we move into the era
of multi-core CPUs, developers will want to exploit multi-core
performance in their games. Unfortunately, Web Workers do not offer the
desired performance benefits in a significant set of cases. Games
programmers would not necessarily care about the reliability benefits of
the Web Workers model, or its ability to potentially scale to multi-cpu
architectures lacking shared memory. After all, in games we are dealing
with desktop computers, local processing, and architectures that almost
certainly provide us with shared memory for performance reasons. We will
want to make use of it.

For all of the reasons above, I would like to see something like threads
in Javascript. Yes, threads give rise to race conditions and deadlocks,
but this seems to be in line with Javascript's apparent philosophy of
doing very little static error checking, and letting things just happen
at runtime (e.g. nonexistent static type system). In other words, this
may be simply a case of: yes, javascript allows runtime errors to
happen. Is not allowing deadlocks important enough that we should make
it impossible for a certain class of algorithms to exploit multi-core
CPUs?

As a final thought, I would like to say that it is actually possible to
detect deadlock conditions at runtime and immediately throw an exception
rather than simply hang the web browser. This, in my mind, makes
deadlocks as good a user experience as any other sort of runtime error.

Before I sign off, there is one more feature which (correct me if I'm
wrong) is lacking from the current specification. There is currently no
way for a program to find out how many cores are present in the host
system. Without this, there is no way to know how many Web Workers to
create for an algorithm that could easily be parallelised to any number
of Web Workers / threads. Even, say, a parallel quicksort should not
just create a new thread for each recursive invocation, as deep as it
goes. For efficiency, this thread creation should stop as soon as enough
threads have been created to match the number of physical cores. After
this point, each core should handle its load by reverting to a
single-threaded quicksort.

-- 
Ryan Heise
Received on Wednesday, 21 July 2010 13:11:34 UTC