RE: Workers from Justin James on 2008-08-10 (public-html@w3.org from August 2008)

From: Justin James <j_james@mindspring.com>
Date: Sun, 10 Aug 2008 01:02:56 -0400
To: "'Ian Hickson'" <ian@hixie.ch>, "'Andrew Fedoniouk'" <news@terrainformatica.com>
Cc: <public-html@w3.org>
Message-ID: <000301c8faa6$5a1fd590$0e5f80b0$@com>
> -----Original Message-----
> From: Ian Hickson [mailto:ian@hixie.ch]
> Sent: Friday, August 08, 2008 4:29 AM
> To: Justin James; Andrew Fedoniouk
> Cc: public-html@w3.org
> Subject: RE: Workers
> 
> On Mon, 21 Jul 2008, Justin James wrote:
> > >
> > > It's exactly how the Web works today:
> > >
> > >    <script src="url"></script>
> >
> > QUESTION: How well would a "javascript: URL" work with your approach?
> If
> > the answer is "Great!" (I suspect so), then ignore the next
> paragraph.
> 
> If you mean as the URL for a createWorker() call, it would "work", in
> that
> it would take the URL, evaluate it in the original context, get a
> string
> out of it, and then pass that string to the new thread to be
> interpreted
> as a JavaScript program.

That sounds like the behavior that I would expect and desire.

> > Not the *entire* world, which is my point. There are still tons of
> > people in-lining code into the HTML. Heck, look at what ASP.Net spews
> > out. Even using data: URLs, their code suddenly gets very, very ugly
> > (*uglier* if you ask me) very quickly.
> 
> If we find this to be the case, then we can definitely revisit this,
> but I
> am somewhat skeptical at this point that authors will want to inline
> their
> thread code. As noted before, this has been experimentally implemented
> in
> browser extensions before, and none of the authors using that
> experimental
> implementation liked the idea of having to provide script text, they
> all
> wanted to provide a URL instead.

It really is a "different strokes for different folks" scenario. While I
agree that inline code is nearly always not preferable to code in a separate
file, some people, for whatever number of reasons, still inline code on
purpose. I say, give 'em the rope to hang themselves, not up to us to force
their stuff into a particular architecture. :)

> > > The fetching of the resource happens asynchronously.
> >
> > Like Andrew Fedoniouk, I completely missed the part of the spec
> (right
> > about the numbered list) that specified the "separate and parallel
> > execution" of those steps. I think that, at the very least, the draft
> > should be updated to make this very clear. Would it be possible to
> make
> > the first step in the list, "create a separate thread, and execute
> the
> > rest of the steps in that thread", or somehow otherwise make this
> really
> > obvious? Without including the creation of the thread in the steps,
> it
> > is easy to overlook it at part of the process.
> 
> I was going to add a note, but then I noticed that it actually already
> says it twice -- once in the "create a worker" algorithm, and once in
> the
> "run a worker" algorithm. What should I add to make it clearer?

I think that making it step #1 in the enumerated list would do the trick.
The last time I looked at it, I realized that the reason that I kept missing
it, is because I was looking at the list to see what was happening, but it
is in the paragraph before the list. Since it *is* a step in creating the
working, I think that adding it to the list would be reasonable.

> > I agree that different platforms will have different cap/throttle
> levels.
> > But the code authors need to be able to check to see if they hit it!
> 
> Why?

Because it is *very* common to take an "alternate" route if a thread will
not run immediately. Some use cases:

* For a critical task, if I've hit the limit, I may choose to *not* create a
separate thread, and instead choose to run it in the primary thread:
if (Window.WorkerLimitMet) {
   eval(GetURL(url));
} else {
   createWorker(url);
}

* For a time-sensitive, but unimportant task (say, putting up graphic
"please wait" in response to user input that will only be on the screen for
a second or so), it is better to just bypass the logic altogether than to
wait on it:
if (!Window.WorkerLimitMet) {
   createWorker(url);
}

* Some applications may very well wish to limit or restrict user input until
the queue can accept more work. For example:
while (Window.WorkerLimitMet) {
   Form1.SubmitButton.Enabled = false;
   sleep(100);
}

Those are three really good scenarios that I can think of in a few minutes.
If we can't dictate how many workers may run at once due to platform limits,
then developers need to know when they are at those limits.

> > Some code may want to treat hitting the throttle as an error
> condition,
> > other may want to simply ignore it.
> 
> It seems like it would be bad for code to ever stop working just
> because
> the user has throttled its CPU usage.

Not "not working", simply decided to not do something. It's the same reason
why VoIP and other streaming media systems use UDP instead of TCP. If a
packet is dropped, better to ignore it an continue than to attempt to
recover from it. In some applications, waiting to do something is worse than
not doing it at all, especially for time critical tasks. Doing something
onMouseOver() is a good example. If someone is wildly waving their mouse,
better to start dropping it than to queue up workers. Think about this kind
of code for a moment:

onMouseOver = "createWorker(urlToScript)"

user starts waving their mouse wildly...

browser launches, say, 10 threads and throttles, so the 11th MouseOver event
blocks. Meanwhile, user keeps waving mouse, which then queues up hundreds
more MouseOver events in the event queue. Result? Hung browser. Letting the
developer detect the throttle condition alleviates this kind of problem.

> > Also, the spec needs to clearly enunciate *precisely* the way a
> throttle
> > is implemented, so at the very least, all browsers handle it the
> same.
> > Does hitting throttle cause the call to create the worker to block?
> Or
> > is the worker created, but execution is delayed, which would allow
> the
> > calling code to continue (at the expense of the memory used by a
> > separate object)? Or does it throw an error (such as "cannot create
> > thread")?
> 
> It could also create a worker, but run it slowly.

It *could*, but that would be supremely dumb behavior; each thread takes up
space in memory, regardless of whether or not it is running.

> There's no way to test what a user agent is doing here. We can't know
> if
> it's JS implementation just happens to block on calls to createWorker()
> sometimes because that's where it happens to run a very expensive GC
> process, for instance. It's also very difficult to specify what should
> happen in the case of hardware limitations. Historically, we've left
> that
> mostly undefined, and, unlike most other things left undefined, this
> one
> hasn't been a huge problem.
> 
> I don't know how we would even go about testing such requirements.

That's why I suggest we define what a throttling mechanism is allowed to do,
and what it is not allowed to do, and provide a mechanism for detecting
throttle and an overload of createWorker() that accepts a timeout value.
There is a reason why implementations are various "thread pool" type objects
provide this functionality, and it isn't for the sake of needed extra
documentation. :)

> > For example:
> >
> > for (i = 0; i <= 1000000; i++) {
> > arrayOfMessagePorts[i] = createWorker(arrayOfURLs[i]);
> > }
> >
> > Yes, I know that it is an extreme example (not really, if you want to
> do
> > something to the individual pixels of an image in parallel...), but
> it
> > illustrates the problem well.
> 
> How is this different to, say:
> 
>    for (i = 0; i <= 1000000; i++) {
>      arrayOfDocuments[i] = document.implementation.createdocument(null,
> null, null);
>    }
> 
> ...?

It's the same at a technical level, but quite different from a programmer's
viewpoint. A programmer, writing what you wrote, has the expectation that
they are creating 1,000,000 objects, and knows it before the code even runs,
and can make the decision to do it based on that information up front. A
programmer writing what I wrote does not know in advance how many objects
they are creating (they know that eventually 1,000,000 object will have been
created, but has no idea how many will be in scope at any given time), and
depending on the UA, it may or may not run well. So it's a matter of
perception, not technical.

> > On the other hand, without our draft explicitly stating how a browser
> > should perform a throttle, the developer has no clue how to write
> code
> > and take the possibility of throttling into account.
> 
> The best way is probably to pretend it never happens, and let the
> browsers deal with it in a graceful manner.

In my time developing applications that use multithreading, it is *critical*
to understand at the time of writing the code if/how your thread creation
blocks. That million thread example above is a great example. If you write
it assuming that the UA will block when a throttle is reached, you're in
great shape. But then a UA comes along that *doesn't* block on thread
creation, it just blocks the workers from running. Now, you've got a million
objects created on the stack, which will (from personal experience) create a
fairly miserable user experience. If you know up front that no throttling
can occur, then you need to cobble together a fairly complicated and
potentially unreliable system of having a globally visible object (like a
semaphore) that will let you figure out how many threads *are* running, and
then create a system like a spin lock waiting on that semaphore in order to
let threads elect themselves to start execution, all while attempting to
guess within code how many simultaneous threads is acceptable... it's a
total disaster, especially since in this particular threading model (more
like a very weak process model), there are no shared objects. Therefore, it
is *imperative* that UA's throttle. And a throttling UA really needs to
expose some of the details of the throttling to code so that the code can
adapt and adjust to run-time realities.

> > I propose the following changes to make this situation (and other
> > pitfalls of parallel execution) far less dangerous to authors:
> >
> > * Overloading the createWorker() method is to accept a time span (in
> > ticks or milliseconds) as a timeout value.
> 
> That seems very dangerous, what if you're running on a machine where
> everything is just slower, and the timeout would always be hit?

The code developer picks a timeout appropriate for their application. If you
know that the work you are doing can keep a CPU busy for a while, pick a
high timeout number. Or, alternatively, a code author can write some code
that can use "experience" to determine an appropriate timeout value. For
example, each time the timeout is hit and it "drops an iteration" (or
whatever the code wants to do when it hits timeout), it could calculate what
percentage of its iterations hit timeout, and if that percentage is high,
slowly start to increase the timeout. For example:

var iThrottledCreations = 0;
var iTimeout = 500; //0.5 seconds
var fAcceptableThreasholdPercentage = 0.25; //Allow up to 25% of workers to
throttle before increasing threshold
car fIncreaseAmount = 0.1; //Increase timeout by 10% each time we bump it up

for (iCounter = 0; iCounter <= 1000000; iCounter++) {
   messagePorts[i] = createWorker(urls[i], iTimeout);

   if (iMessagePorts[i] == null) {
      //Worker creation timed out
      iThrottledCreations++;

      if (iThrottledCreations / iCounter > fAcceptableThresholdPercentage) {
         iTimeout += Math.ceil(iTimeout * fIncreaseAmount);
      }
   }
}

See how nifty this little piece of code is? It keeps the timeout as low as
needed to maintain a decent user experience *on that platform* and rapidly
responds to conditions. It basically says, "it's OK to throttle within
reason, but past a certain point, I prefer to block" in a fairly simple
manner.

> > * Defining user agent "throttling" as *always* blocking when calling
> the
> > createWorker() method.
> 
> What if the user agent knows that it could handle the load effectively
> by
> farming off the thread to another computer altogether?

Um, how many UAs support grid computing? And, if it could do some sort of
distributed computing, then it wouldn't throttle, it would just have the
remote resource run the worker.

> Also, how would we test that this requirement was being met?

Not sure I understand the question. Which requirement?

> > * Make it clear in the draft that just because the environment has
> > returned from createWorker() does *not* mean that execution of the
> > worker logic has started. Language such as, "creation of a worker
> object
> > does not guarantee that the worker is executing, it only guarantees
> that
> > the worker has been queued for execution" should be extremely
> helpful.
> > This is very, VERY important!
> 
> Again, how could we tell the difference between the thread having
> started
> but going really really slowly, and the thread being paused altogether
> and
> not having started?

We can't. But this is something that developers need to know is completely
undefined (order of thread completion/execution). Either that, or we need to
define it explicitly. All I am asking for is a disclaimer to be added to the
existing text (assuming that the order of worker execution is
non-deterministic, which is your intent, I believe) saying, "hey buddy,
don't expect the workers to run in the order that they are created, and
don't expect that the creation of a worker implies that it will begin to run
immediately." Either that, or we define that behavior. By the way, if we are
going to say that the creation order will not affect the order of execution,
then some people are going to clamor for a scheduling/priority system (see
the discussion of on this very topic in the new Parallel Extensions Library
for .Net: http://blogs.msdn.com/pfxteam/archive/2008/08/01/8800195.aspx).

> > * Make it clear in the draft that queued workers do not necessarily
> > begin execution in the same order that they were created.
> 
> Again, how could we test this or prove it either way?
> 
> We can't require things we can't test, because we have to prove that
> every
> requirements has been met to finish our specification, as part of the
> W3C
> process. (That's one reason it's going to take so long.)

You are reading me completely backwards, I suspect. I agree that we *can't*
test or require any kind of behavior like this, which means that the spec
needs to make it *explicit* that these things are not defined. Otherwise,
developers will fall into three groups, with disastrous results (for the
first 2 groups):

* Those who make an assumption about these things
* Those who test in one or two UAs, and then attribute that behavior to
*all* UAs
* Those who waste a bunch of time testing in all UAs

> > * Make it clear in the draft that the environment does not have to
> > devote equal resources to each worker. Therefore, even if Worker A
> > started execution before Worker B, and that they should both take the
> > same amount of time to execute, Worker B could finish long before
> Worker
> > A. Just because. :)
> 
> Again, I don't see how we could test this. We already allow user agents
> to
> pretty much do anything to handle hardware limitations.

Yes. Again, you've read me backwards. I'm stating that the spec needs to
explicitly state that this is *undefined* and up to the UA.

> > What the first two changes accomplish, is the ability of the previous
> > example to be modified like so:
> >
> > for (i = 0; i <= 1000000; i++) {
> > //Assuming timeout is measured in milliseconds
> > arrayOfMessagePorts[i] = createWorker(arrayOfURLs[i], 1000);
> > if (arrayOfMessagePorts[i] == null) {
> > //Throttling is occurring, try waiting a while and re-creating the
> worker!
> > //Or just exit, if it is not a big deal.
> > }
> > }
> 
> Why would that be better? Surely it would be better or the user agent
> to
> do all the throttling itself instead.

Sometimes you don't want to wait for a thread to be created past a certain
time, or in some situations, you may very well want to do something (like
letting the user know to be patient) and re-try to queue. Imagine, for
example:

for (iCounter = 0; iCounter < 10; iCounter++) {
   messagesPorts[i] = createWorker(urls[i], 1000);

   if (messagePorts[i]) == null) {
      form.status1.text = Backlog! Not all items will be processed.';
      break;
   }
}

> > This allows applications to graceful recover, slow down, or whatever
> is
> > needed in the case of a throttling scenario, without jamming up the
> > whole system.
> 
> It seems like we should leave that up to the browser.

Not at all. The UA's responsibility is to throttle as appropriate to the
platform and conditions. The application, on the other hand, needs to be
able to respond to that throttling in a manner that is consistent with that
application's needs. The following use cases for workers all would need to
react differently to a throttle condition:

* Something in which all items must be processed and the user has no
expectations on when it will be completed

* Something in which the processing of many items must be performed as
closely to real-time as possible; throttling beyond a certain amount renders
the work irrelevant, so it is better to "drop iterations" under throttle
conditions than to further compound the overload by demanding that they all
be processed

* Something in which all items must be processed, but under throttling
circumstances, the application will want to do something exceptional, such
as notify the user that throttling is occurring, trying an alternative
resource, or so on.

> On Tue, 22 Jul 2008, Justin James wrote:
> >
> > I can see why this makes sense, but I find it hard to believe that
> > browser vendors cannot build a thread system into their JavaScript
> > interpreters that does this sensibly, without needing to use a
> separate
> > document. They can download files in a separate thread, I am not sure
> > why they can't have a multithreaded JavaScript system too.
> 
> I shall let the browser vendors speak to this.

I'm interested in hearing their view on it.

> > * In some cases, particularly workers created during onLoad, there
> will
> > be such severe contention for the download queue that it will take
> > longer to run the script from the remote URL than a self-stored
> script.
> 
> This seems unlikely. All use cases I can think of for running many
> scripts
> will all be running the same one (or few) scripts, not many many
> different
> ones.

Since as far as I can tell, the only way to pass parameters to these scripts
is via the URL itself, I think that you are missing out. Let's say you want
to do some image processing, so you're going through the pixels of an image:

var sBaseURL = 'http://www.domain.com/scripts/pixelprocess.aspx?colorCode=';

for (x = 0; x < image.width; x++) {
   for (y = 0; y < image.height; y++) {
      messagePorts[x, y] = createWorker(sBaseURL + image.pixels[x,
y].color);
   }
}

Now, in this scenario, it could easily be calling 1,000,000 URLs, all unique
or a great many of them unique.

Or, you need to provide a parameter passing system. And even that, with the
current message passing system, is dicey and complicated.

There is a reason why all of the worker systems that Microsoft introduced in
.Net 2.0 got so little uptake (other than providing "Cancel Button"
functionality...), and why the Parallel Extensions Library blows it away,
and that's because the worker system is difficult for many people to wrap
their heads around the concepts of parameters and return values, and the
Parallel Extensions Library makes these things much more like a standard
method call.

> > * What happens in regard to caching? Let's say, for example, that a
> > mouse over operation spawns a worker, and the user is madly waving
> their
> > mouse about. Does the browser re-download the script each time? If
> so,
> > you are going to see a TON of lag. Or does it cache the script? Can
> the
> > developer request that the item be cached or override the cache? If
> so,
> > what mechanism is available for that?
> 
> HTTP defines how caching works already.

HTTP does *not* define how caching works for non-HTTP URLs, though. :)

> > * What happens if, for whatever reason, the script contents should be
> > generated by the server, and GET is inadequate? Is there a method to
> do
> > a full POST to get the script from a server?
> 
> No. What would the use case be for doing a post submission as part of
> obtaining a script? That seems quite odd.

Not if you want to post a file to the server or provide a large amount of
information to the server to generate the script that you may not want to
trust to the undefined limits of a GET URL.

> > * What is considered a "failure" to download the script? Will the
> > download system follow redirects, for example?
> 
> Again, HTTP defines this.

Fair enough on this one.

> On Tue, 22 Jul 2008, Justin James wrote:
> >
> > I think that it is critical that you read this white paper from
> > Microsoft regarding data: URLs in IE 8:
> >
> >
> http://code.msdn.microsoft.com/Release/ProjectReleases.aspx?ProjectName
> =ie8w
> > hitepapers&ReleaseId=575
> >
> > Some highlights:
> >
> > * Data URLs are not allowed to contain scripts
> 
> This is a limitation of IE's implementation, and not a fundamental
> limitation of data: URLs.

Yes, but it renders data URLs nearly useless in this context.

> > * There is a 32 KB cap on data: URLs
> 
> This again is just a limitation of IE's implementation. (Though one has
> to
> wonder, why would you generate a URL of more than 32KB? Wouldn't it
> make
> more sense to generate the part that changes, and then fetch the rest
> as
> part of an importScripts() call?)

You wouldn't want to generate an *URL* of more than 32 KB, but you quite
often have a *script* of more than 32 KB!

> > * Something we both missed in the data: URL RFC is that the only
> > characters allowed in a data: URL are characters allowed in an URL;
> this
> > makes putting useful code in a data: URL impossible.
> 
> Characters can be escaped using %xx escapes, so this isn't a problem.

I don't think *anyone* needs to write JavaScript (or any other language) and
escape it. Remember, for these items, we are talking about a scenario where
the data: URL under consideration is actually JavaScript, as per your
suggestion that using a data: URL would allow inline scripts to be run via
the worker mechanism.

I'm finding that an absolutely huge hole in this implementation is in
passing initial parameters. The only way I am seeing to pass parameters in,
is with the message port system. The example in the current draft involves a
whopping TEN (10), yes, TEN (10) lines of code in order to extract TWO (2)
parameters as initial input. That is simply unacceptable.

At this point, I think what you need to do are the following items, to make
this a truly flexible, useful, and usable system:

* Add text to the spec that indicates that the order of thread execution is
possibly non-deterministic, depending upon the UA's implementation

* Add text making it clear that a UA may create a worker long before
actually executing it, depending upon the UA

* Find a really good way to pass parameters to the worker's initial runtime
context; my suggestion is to allow an array of objects to be passed in by
value as a "seed" environment to a variable local to the worker object

* Specify that an explicit throttling system must block the worker's
creation, as opposed to allowing the worker to be created but not actually
run; a UA may set the threshold for throttling as high as it likes,
including "unlimited" (which effectively allows unlimited workers to be
created, without starting many of them); note that this does not preclude a
"soft throttle" like some UAs may choose to implement (although that would
be very foolish to implement)

* Define a system for UAs to indicate whether or not throttling is occurring
(very optional)

* Define an overload for createWorker than accepts a timeout (for the
creation of the worker)

* The draft as it currently stands is really difficult to follow, especially
in some critical areas. Section 2.5, step 8 is a very good example - it
involves jumping to a whopping FIVE different areas in order to understand
it (some of which, in turn, have further back references); the average
person cannot do back references like that and really make sense of it

* Passing strings for messages is creating some clumsy casting issues, as
your examples make clear: "end = 1*event.message;" stands out. Any chance of
passing an object and not a string? Or would that cause reference/value
problems?

Hope this helps!

J.Ja
Received on Sunday, 10 August 2008 05:03:38 UTC