RE: Workers from Justin James on 2008-07-21 (public-html@w3.org from July 2008)

From: Justin James <j_james@mindspring.com>
Date: Mon, 21 Jul 2008 11:50:24 -0400
To: "'Ian Hickson'" <ian@hixie.ch>
Cc: <public-html@w3.org>
Message-ID: <00a001c8eb49$7d849500$788dbf00$@com>
> -----Original Message-----
> From: public-html-request@w3.org [mailto:public-html-request@w3.org] On
> Behalf Of Ian Hickson
> Sent: Sunday, July 20, 2008 2:29 AM
> To: Justin James
> Cc: public-html@w3.org
> Subject: RE: Workers
> On Sun, 20 Jul 2008, Justin James wrote:
> > >
> > > How would you communicate with such a mechanism?
> >
> > I suppose it could take a second argument for a thread-safe messaging
> > object.
> 
> That's basically what MessagePorts are, and basically how
> createWorker()
> works, except that it creates the ports for you as a convenience:

Yup, I know.

> > >    var port = createWorker(url);
> >
> > Yes, I am sure that if I saw the world from the eyes of the Gears
> team,
> > that might seem like the best way to do it. But I'm from a more
> > traditional background, and frankly, the idea of passing an URL to a
> > script seems incredibly backwards and clumsy. Offhand, I cannot
> recall
> > ever seeing a system of any sort where you basically say, "execute
> the
> > code located by this reference".
> 
> It's exactly how the Web works today:
> 
>    <script src="url"></script>

QUESTION: How well would a "javascript: URL" work with your approach? If the
answer is "Great!" (I suspect so), then ignore the next paragraph.

Not the *entire* world, which is my point. There are still tons of people
in-lining code into the HTML. Heck, look at what ASP.Net spews out. Even
using data: URLs, their code suddenly gets very, very ugly (*uglier* if you
ask me) very quickly.

> > I want to see a *function* for executing the work in a thread (even
> if
> > it is a method of the Window object), not a "WindowWorker object"
> with a
> > hidden/invisible "Execute" method.
> 
> The WindowWorker object isn't how you execute code, it's just the
> global
> object. Whatever mechanism we use, we have to have a global object.

Yeah, I later figured this out. This draft is *really* difficult to follow,
and reading the thread with Andrew, I am not the only one who is having a
hard time reading it. I can't really put my finger on it, but it feels more
like the summary of a conversation between a group of people who already
intimately know the subject and just need to have it on paper than an actual
spec. I know, that's why it's a draft and not the final form. :)

> As far as I can tell, data: URLs of megabytes in length work fine in
> all
> major shipping browsers that support data: URLs. Can you give an
> example
> of a major browser that supports data: URLs but doesn't support long
> enough data: URLs to handle the script you want to handle? (And why
> would
> you have that script in text form instead of accessible from a URL?)

I got burned so many times in the mid-90's by browser URL length problems,
that I have not tried to exceed 255 characters in an URL since then.

What you are saying though is this: Code that works correctly in one browser
in one 100% spec-compliant browser may not work correctly in a different
100% spec-compliant browser. And that is not an acceptable situation.

The more I write and revise my responses to your message, the more I realize
that probably 99% of my objections are caused by the lack of a proper
specification around data: URLs. I have submitted a spec proposal via the
bug tracker to add it to the HTML spec. :)

> I respect your opinion, but practical experience from actual Web
> authors
> writing code with experimental Workers implementations have more
> weight. :-)

I never discounted the Gears team's experience. I'm just saying that the
world is a lot bigger than their experience. There is a TON of existing
practical experience showing the value of self-storing/modifying/generating
code out there. The fact that no one is actually doing it with this
experimental worker implementation is probably related to the fact that like
30 (or less!) people on the planet are working with experimental worker
implementations, and that those use cases are less common, particularly on
the Web, than what the Gears team is doing.

Regarding your responses to my example use cases (no need to go through them
individually)... your responses are all valid. While writing a response to
each one, I realized that you will probably never see eye-to-eye with me on
it, because we hail from different backgrounds. For me, doing things in a
dynamic/functional language way is fairly intuitive and natural. It is clear
that you approach these things much more from the angle of a static
language. There is nothing inherently better or worse about either
viewpoint, either.

At this point, like I said, the only thing I disagree with the creation from
URL is the fact that the data: URL spec is so problematic.

> > One final note on the existing draft: I also find it problematic that
> > the locating and procurement of the script located with the URL
> > parameter does *not* occur in a separate thread. Considering that in
> a
> > many cases, the HTTP connect/transmit/close cycle takes a huge
> portion
> > (if not the majority) of the total execution time, I think you lose
> most
> > of the intended benefit by having that be in the main thread.
> 
> I think you're misreading the spec. The fetching of the resource
> happens
> asynchronously.

Like Andrew Fedoniouk, I completely missed the part of the spec (right about
the numbered list) that specified the "separate and parallel execution" of
those steps. I think that, at the very least, the draft should be updated to
make this very clear. Would it be possible to make the first step in the
list, "create a separate thread, and execute the rest of the steps in that
thread", or somehow otherwise make this really obvious? Without including
the creation of the thread in the steps, it is easy to overlook it at part
of the process.

> Browsers are allowed to throttle the code as much as they like. We
> can't
> really do anything else since user agents run on such varied hardware
> that
> there's no way to really guarantee particular performance
> characteristics
> anyway.

I agree that different platforms will have different cap/throttle levels.
But the code authors need to be able to check to see if they hit it! Some
code may want to treat hitting the throttle as an error condition, other may
want to simply ignore it. Also, the spec needs to clearly enunciate
*precisely* the way a throttle is implemented, so at the very least, all
browsers handle it the same. Does hitting throttle cause the call to create
the worker to block? Or is the worker created, but execution is delayed,
which would allow the calling code to continue (at the expense of the memory
used by a separate object)? Or does it throw an error (such as "cannot
create thread")? For example:

for (i = 0; i <= 1000000; i++) {
arrayOfMessagePorts[i] = createWorker(arrayOfURLs[i]);
}

Yes, I know that it is an extreme example (not really, if you want to do
something to the individual pixels of an image in parallel...), but it
illustrates the problem well. If the createWorker(URL) method does not
block, you can easily trash the RAM situation by creating a million thread
objects like this. From my experience doing just this in other languages, I
can tell you that it gets ugly. On the other hand, without our draft
explicitly stating how a browser should perform a throttle, the developer
has no clue how to write code and take the possibility of throttling into
account.

I propose the following changes to make this situation (and other pitfalls
of parallel execution) far less dangerous to authors:

* Overloading the createWorker() method is to accept a time span (in ticks
or milliseconds) as a timeout value.

* Defining user agent "throttling" as *always* blocking when calling the
createWorker() method.

* Make it clear in the draft that just because the environment has returned
from createWorker() does *not* mean that execution of the worker logic has
started. Language such as, "creation of a worker object does not guarantee
that the worker is executing, it only guarantees that the worker has been
queued for execution" should be extremely helpful. This is very, VERY
important!

* Make it clear in the draft that queued workers do not necessarily begin
execution in the same order that they were created.

* Make it clear in the draft that the environment does not have to devote
equal resources to each worker. Therefore, even if Worker A started
execution before Worker B, and that they should both take the same amount of
time to execute, Worker B could finish long before Worker A. Just because.
:)

What the first two changes accomplish, is the ability of the previous
example to be modified like so:

for (i = 0; i <= 1000000; i++) {
//Assuming timeout is measured in milliseconds
arrayOfMessagePorts[i] = createWorker(arrayOfURLs[i], 1000);
if (arrayOfMessagePorts[i] == null) {
//Throttling is occurring, try waiting a while and re-creating the worker!
//Or just exit, if it is not a big deal.
}
}

This allows applications to graceful recover, slow down, or whatever is
needed in the case of a throttling scenario, without jamming up the whole
system.

Hope this all makes sense, and helps!

J.Ja
Received on Monday, 21 July 2008 15:51:20 UTC