RE: Workers from Justin James on 2008-08-27 (public-html@w3.org from August 2008)

From: Justin James <j_james@mindspring.com>
Date: Wed, 27 Aug 2008 13:04:25 -0400
To: "'Ian Hickson'" <ian@hixie.ch>
Cc: <public-html@w3.org>
Message-ID: <0aba01c90866$f55c40c0$e014c240$@com>
> -----Original Message-----
> From: Ian Hickson [mailto:ian@hixie.ch]
> Sent: Wednesday, August 27, 2008 6:18 AM
> To: Justin James
> Cc: public-html@w3.org
> Subject: RE: Workers
> 
> On Sun, 10 Aug 2008, Justin James wrote:
> > >
> > > I was going to add a note, but then I noticed that it actually
> already
> > > says it twice -- once in the "create a worker" algorithm, and once
> in
> > > the "run a worker" algorithm. What should I add to make it clearer?
> >
> > I think that making it step #1 in the enumerated list would do the
> > trick. The last time I looked at it, I realized that the reason that
> I
> > kept missing it, is because I was looking at the list to see what was
> > happening, but it is in the paragraph before the list. Since it *is*
> a
> > step in creating the working, I think that adding it to the list
> would
> > be reasonable.
> 
> Done.

Great!

> > > > I agree that different platforms will have different cap/throttle
> > > > levels. But the code authors need to be able to check to see if
> they
> > > > hit it!
> > >
> > > Why?
> >
> > Because it is *very* common to take an "alternate" route if a thread
> > will not run immediately. Some use cases:
> >
> > * For a critical task, if I've hit the limit, I may choose to *not*
> > create a separate thread, and instead choose to run it in the primary
> > thread:
> >
> > if (Window.WorkerLimitMet) {
> >    eval(GetURL(url));
> > } else {
> >    createWorker(url);
> > }
> 
> I don't really buy that example (you'll hit network limits long before
> CPU
> limits for I/O tasks), and I can't really think of any realistic ones,
> so
> I'm not convinced of this use case.

I finally figured out why I'm coming up with use cases that aren't seeming
clear to you... most of the code that I write that uses threads is more in
the "parallel processing" vein, and this concept is aimed more at
"asynchronous processing". So my experience is informed quite heavily by
running, say, 1,000,000 copies of the same block of code against different
values from an array, and your experience is informed by maybe fetching an
XML file "behind the scenes" and parsing it and displaying some values from
it to the user.

I think at this point, I think I should stop trying to come up with use
cases, and withdraw the ones I have already presented, since my use cases
are trying to do things with Web Workers that the spec is not intended for
them to do.

Do you think that it might make sense to provide guidance to developers in
this item that presents some information like, "Web workers are good for
XYZ, they are not good for ABC"?

> > * For a time-sensitive, but unimportant task (say, putting up graphic
> > "please wait" in response to user input that will only be on the
> screen
> > for a second or so), it is better to just bypass the logic altogether
> > than to wait on it:
> >
> > if (!Window.WorkerLimitMet) {
> >    createWorker(url);
> > }
> 
> You'd never use a worker for UI-related tasks, since the workers can't
> get
> to the UI. What realistic cases would there be for worker-level tasks
> that
> are unimportant enough that you could just not do them?

I keep forgetting that the worker does not have UI access! Outside of that,
doing something like real-time streaming tends to prefer that work be
dropped rather than arrive late and/or out of order (VoIP comes to mind).
But this is another example of where my experience is in things that Web
Workers is not intended to handle.

> > * Some applications may very well wish to limit or restrict user
> input
> > until the queue can accept more work. For example:
> >
> > while (Window.WorkerLimitMet) {
> >    Form1.SubmitButton.Enabled = false;
> >    sleep(100);
> > }
> 
> Users are quite capable of noticing when their computer is under load,
> I
> don't think it makes sense to artificially limit how much work the
> computer can do like this.

Like Alfonso said, that is exactly when users tend to start pressing lots of
buttons... remember when the answer to everything was to hit ESC? ;)

> > If we can't dictate how many workers may run at once due to platform
> > limits, then developers need to know when they are at those limits.
> 
> We don't provide a way for applications to know when they hit other
> limits, and I don't really see this as special.

Yeah, that's true. If typical Web developers are OK with that, it's not a
big deal. I also recognize that the HTML DOM and associated systems has to
be not as complex as desktop and server systems for a variety of reasons,
and simply saying, "hey, if I can do it in C# in a Windows Forms
application, so we need it in Web browser JavaScript" isn't going to work.
:)

> > Doing something onMouseOver() is a good example. If someone is wildly
> > waving their mouse, better to start dropping it than to queue up
> > workers. Think about this kind of code for a moment:
> >
> > onMouseOver = "createWorker(urlToScript)"
> >
> > user starts waving their mouse wildly...
> 
> I can't see _any_ valid reason to _ever_ create a worker from mouse
> movements. What possible use case could that have? Just create one
> worker
> and queue work up with it.

The worker queue system was still fuzzy to me (I'm "getting" it, slowly)
when I wrote that, I didn't realize at the time that it is easy to queue up
events to the worker, instead of firing a new worker for each event.

> > > It could also create a worker, but run it slowly.
> >
> > It *could*, but that would be supremely dumb behavior; each thread
> takes
> > up space in memory, regardless of whether or not it is running.
> 
> Workers aren't _that_ expensive. If a worker is using 100% CPU on a
> core,
> you'll run out of cores long before you run out of memory. Running
> workers
> slowly (sharing cores) seems much more reasonable than not running them
> at
> all.

Workers as you've written them and intended them to be are not. In some of
the multithreaded/parallel computing items I've worked on, it's become an
issue (thus, the existence of thread pools and other mechanisms to limit
thread creation and avoid CPU saturation). Since this isn't the kind of work
that Web Workers is designed for, I'll agree with you that the use case I've
presented here is not applicable.

> > > I don't know how we would even go about testing such requirements.
> >
> > That's why I suggest we define what a throttling mechanism is allowed
> to
> > do, and what it is not allowed to do, and provide a mechanism for
> > detecting throttle and an overload of createWorker() that accepts a
> > timeout value. There is a reason why implementations are various
> "thread
> > pool" type objects provide this functionality, and it isn't for the
> sake
> > of needed extra documentation. :)
> 
> This may be something we'll have to add in future, but for now I really
> don't see this as something critical enough for the first version.

Agreed.

> > > > For example:
> > > >
> > > > for (i = 0; i <= 1000000; i++) {
> > > > arrayOfMessagePorts[i] = createWorker(arrayOfURLs[i]);
> > > > }
> > > >
> > > > Yes, I know that it is an extreme example (not really, if you
> want
> > > > to do something to the individual pixels of an image in
> > > > parallel...), but it illustrates the problem well.
> > >
> > > How is this different to, say:
> > >
> > >    for (i = 0; i <= 1000000; i++) {
> > >      arrayOfDocuments[i] =
> document.implementation.createdocument(null,
> > > null, null);
> > >    }
> > >
> > > ...?
> >
> > It's the same at a technical level, but quite different from a
> > programmer's viewpoint. A programmer, writing what you wrote, has the
> > expectation that they are creating 1,000,000 objects, and knows it
> > before the code even runs, and can make the decision to do it based
> on
> > that information up front. A programmer writing what I wrote does not
> > know in advance how many objects they are creating (they know that
> > eventually 1,000,000 object will have been created, but has no idea
> how
> > many will be in scope at any given time), and depending on the UA, it
> > may or may not run well. So it's a matter of perception, not
> technical.
> 
> I don't buy that. If you are firing 1000000 workers back to back, you
> don't expect them to complete quickly enough that you only have 10 or
> so
> active at a time. The whole point of workers is you use them for long
> computation, if they could return so quickly, then using workers is
> just
> adding unnecessary overhead.

That's a true statement for any kind of MT/PP work; the problem is, it is
almost impossible to know at design time what that balance will be like.

> > I'm stating that the spec needs to explicitly state that this is
> > *undefined* and up to the UA.
> 
> It already does:
> 
> # User agents may impose implementation-specific limits on otherwise
> # unconstrained inputs, e.g. to prevent denial of service attacks, to
> # guard against running out of memory, or to work around
> # platform-specific limitations.
>  -- http://www.whatwg.org/specs/web-workers/current-work/#conformance

Ah, OK, thanks! I missed that!

> > > This seems unlikely. All use cases I can think of for running many
> > > scripts will all be running the same one (or few) scripts, not many
> > > many different ones.
> >
> > Since as far as I can tell, the only way to pass parameters to these
> > scripts is via the URL itself, I think that you are missing out.
> 
> You can pass parameters using postMessage().

That passes data into the queue of the worker itself. But what if you need
to pass data to the server when you call the URL? For example, maybe the
code provided by URL is dynamically generated by the server? I've done
things like this before, and in certain situations it makes a lot of sense.
In those situations, you want to be able to do POST to the server to
retrieve the script used to create the Web Worker.

> > Let's say you want to do some image processing, so you're going
> through
> > the pixels of an image:
> >
> > var sBaseURL =
> 'http://www.domain.com/scripts/pixelprocess.aspx?colorCode=';
> >
> > for (x = 0; x < image.width; x++) {
> >    for (y = 0; y < image.height; y++) {
> >       messagePorts[x, y] = createWorker(sBaseURL + image.pixels[x,
> > y].color);
> >    }
> > }
>
> Good lord, don't do that.
> 
> Just shard the image into a few pieces and postMessage() the data from
> each shard to a worker. Creating one worker per pixel of an image is
> completely ridiculous.

Sometimes it isn't. :) Indeed, if the algorithm is non-deterministic in
terms of horsepower needed per iteration, one thread per pixel is the most
efficient way of doing this.

But, that being said, I highly doubt that anyone will be using Web Workers
like this (as you've said) anyways, so if a rare edge case is possibly less
efficient... oh well. :)

> > > This again is just a limitation of IE's implementation. (Though one
> > > has to wonder, why would you generate a URL of more than 32KB?
> > > Wouldn't it make more sense to generate the part that changes, and
> > > then fetch the rest as part of an importScripts() call?)
> >
> > You wouldn't want to generate an *URL* of more than 32 KB, but you
> quite
> > often have a *script* of more than 32 KB!
> 
> You wouldn't have 32KB of script that changes each time. You'd just
> have a
> small bit of code changing each time, and the rest could be imported,
> and
> not part of the URL.

For most cases, yes.

> > I'm finding that an absolutely huge hole in this implementation is in
> > passing initial parameters. The only way I am seeing to pass
> parameters
> > in, is with the message port system. The example in the current draft
> > involves a whopping TEN (10), yes, TEN (10) lines of code in order to
> > extract TWO (2) parameters as initial input. That is simply
> > unacceptable.
> 
> This will be solved when we allow structured data passing later.

Excellent! At this point, this is the big problem with the existing draft,
the rest of the stuff is fairly irrelevant. We really do not need a robust
parallel processing framework in the HTML DOM, against my personal desires.
:)

J.Ja
Received on Wednesday, 27 August 2008 17:16:15 UTC