RE: Workers from Ian Hickson on 2008-08-08 (public-html@w3.org from August 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Fri, 8 Aug 2008 08:28:51 +0000 (UTC)
To: Justin James <j_james@mindspring.com>, Andrew Fedoniouk <news@terrainformatica.com>
Cc: public-html@w3.org
Message-ID: <Pine.LNX.4.62.0808080800570.5140@hixie.dreamhostps.com>
On Mon, 21 Jul 2008, Justin James wrote:
> > 
> > It's exactly how the Web works today:
> > 
> >    <script src="url"></script>
> 
> QUESTION: How well would a "javascript: URL" work with your approach? If 
> the answer is "Great!" (I suspect so), then ignore the next paragraph.

If you mean as the URL for a createWorker() call, it would "work", in that 
it would take the URL, evaluate it in the original context, get a string 
out of it, and then pass that string to the new thread to be interpreted 
as a JavaScript program.


> Not the *entire* world, which is my point. There are still tons of 
> people in-lining code into the HTML. Heck, look at what ASP.Net spews 
> out. Even using data: URLs, their code suddenly gets very, very ugly 
> (*uglier* if you ask me) very quickly.

If we find this to be the case, then we can definitely revisit this, but I 
am somewhat skeptical at this point that authors will want to inline their 
thread code. As noted before, this has been experimentally implemented in 
browser extensions before, and none of the authors using that experimental 
implementation liked the idea of having to provide script text, they all 
wanted to provide a URL instead.


> At this point, like I said, the only thing I disagree with the creation 
> from URL is the fact that the data: URL spec is so problematic.

Ok.


> > The fetching of the resource happens asynchronously.
> 
> Like Andrew Fedoniouk, I completely missed the part of the spec (right 
> about the numbered list) that specified the "separate and parallel 
> execution" of those steps. I think that, at the very least, the draft 
> should be updated to make this very clear. Would it be possible to make 
> the first step in the list, "create a separate thread, and execute the 
> rest of the steps in that thread", or somehow otherwise make this really 
> obvious? Without including the creation of the thread in the steps, it 
> is easy to overlook it at part of the process.

I was going to add a note, but then I noticed that it actually already 
says it twice -- once in the "create a worker" algorithm, and once in the 
"run a worker" algorithm. What should I add to make it clearer?


> I agree that different platforms will have different cap/throttle levels.
> But the code authors need to be able to check to see if they hit it!

Why?


> Some code may want to treat hitting the throttle as an error condition, 
> other may want to simply ignore it.

It seems like it would be bad for code to ever stop working just because 
the user has throttled its CPU usage.


> Also, the spec needs to clearly enunciate *precisely* the way a throttle 
> is implemented, so at the very least, all browsers handle it the same. 
> Does hitting throttle cause the call to create the worker to block? Or 
> is the worker created, but execution is delayed, which would allow the 
> calling code to continue (at the expense of the memory used by a 
> separate object)? Or does it throw an error (such as "cannot create 
> thread")?

It could also create a worker, but run it slowly.

There's no way to test what a user agent is doing here. We can't know if 
it's JS implementation just happens to block on calls to createWorker() 
sometimes because that's where it happens to run a very expensive GC 
process, for instance. It's also very difficult to specify what should 
happen in the case of hardware limitations. Historically, we've left that 
mostly undefined, and, unlike most other things left undefined, this one 
hasn't been a huge problem.

I don't know how we would even go about testing such requirements.


> For example:
> 
> for (i = 0; i <= 1000000; i++) {
> arrayOfMessagePorts[i] = createWorker(arrayOfURLs[i]);
> }
> 
> Yes, I know that it is an extreme example (not really, if you want to do 
> something to the individual pixels of an image in parallel...), but it 
> illustrates the problem well.

How is this different to, say:

   for (i = 0; i <= 1000000; i++) {
     arrayOfDocuments[i] = document.implementation.createdocument(null, null, null);
   }

...?


> On the other hand, without our draft explicitly stating how a browser 
> should perform a throttle, the developer has no clue how to write code 
> and take the possibility of throttling into account.

The best way is probably to pretend it never happens, and let the 
browsers deal with it in a graceful manner.


> I propose the following changes to make this situation (and other 
> pitfalls of parallel execution) far less dangerous to authors:
> 
> * Overloading the createWorker() method is to accept a time span (in 
> ticks or milliseconds) as a timeout value.

That seems very dangerous, what if you're running on a machine where 
everything is just slower, and the timeout would always be hit?


> * Defining user agent "throttling" as *always* blocking when calling the 
> createWorker() method.

What if the user agent knows that it could handle the load effectively by 
farming off the thread to another computer altogether?

Also, how would we test that this requirement was being met?


> * Make it clear in the draft that just because the environment has 
> returned from createWorker() does *not* mean that execution of the 
> worker logic has started. Language such as, "creation of a worker object 
> does not guarantee that the worker is executing, it only guarantees that 
> the worker has been queued for execution" should be extremely helpful. 
> This is very, VERY important!

Again, how could we tell the difference between the thread having started 
but going really really slowly, and the thread being paused altogether and 
not having started?


> * Make it clear in the draft that queued workers do not necessarily 
> begin execution in the same order that they were created.

Again, how could we test this or prove it either way?

We can't require things we can't test, because we have to prove that every 
requirements has been met to finish our specification, as part of the W3C 
process. (That's one reason it's going to take so long.)


> * Make it clear in the draft that the environment does not have to 
> devote equal resources to each worker. Therefore, even if Worker A 
> started execution before Worker B, and that they should both take the 
> same amount of time to execute, Worker B could finish long before Worker 
> A. Just because. :)

Again, I don't see how we could test this. We already allow user agents to 
pretty much do anything to handle hardware limitations.


> What the first two changes accomplish, is the ability of the previous 
> example to be modified like so:
> 
> for (i = 0; i <= 1000000; i++) {
> //Assuming timeout is measured in milliseconds
> arrayOfMessagePorts[i] = createWorker(arrayOfURLs[i], 1000);
> if (arrayOfMessagePorts[i] == null) {
> //Throttling is occurring, try waiting a while and re-creating the worker!
> //Or just exit, if it is not a big deal.
> }
> }

Why would that be better? Surely it would be better or the user agent to 
do all the throttling itself instead.


> This allows applications to graceful recover, slow down, or whatever is 
> needed in the case of a throttling scenario, without jamming up the 
> whole system.

It seems like we should leave that up to the browser.


On Tue, 22 Jul 2008, Justin James wrote:
> 
> I can see why this makes sense, but I find it hard to believe that 
> browser vendors cannot build a thread system into their JavaScript 
> interpreters that does this sensibly, without needing to use a separate 
> document. They can download files in a separate thread, I am not sure 
> why they can't have a multithreaded JavaScript system too.

I shall let the browser vendors speak to this.


> * In some cases, particularly workers created during onLoad, there will 
> be such severe contention for the download queue that it will take 
> longer to run the script from the remote URL than a self-stored script.

This seems unlikely. All use cases I can think of for running many scripts 
will all be running the same one (or few) scripts, not many many different 
ones.


> * What happens in regard to caching? Let's say, for example, that a 
> mouse over operation spawns a worker, and the user is madly waving their 
> mouse about. Does the browser re-download the script each time? If so, 
> you are going to see a TON of lag. Or does it cache the script? Can the 
> developer request that the item be cached or override the cache? If so, 
> what mechanism is available for that?

HTTP defines how caching works already.


> * What happens if, for whatever reason, the script contents should be 
> generated by the server, and GET is inadequate? Is there a method to do 
> a full POST to get the script from a server?

No. What would the use case be for doing a post submission as part of 
obtaining a script? That seems quite odd.


> * What is considered a "failure" to download the script? Will the 
> download system follow redirects, for example?

Again, HTTP defines this.


On Tue, 22 Jul 2008, Justin James wrote:
> 
> I think that it is critical that you read this white paper from 
> Microsoft regarding data: URLs in IE 8:
> 
> http://code.msdn.microsoft.com/Release/ProjectReleases.aspx?ProjectName=ie8w 
> hitepapers&ReleaseId=575
> 
> Some highlights:
> 
> * Data URLs are not allowed to contain scripts

This is a limitation of IE's implementation, and not a fundamental 
limitation of data: URLs.


> * There is a 32 KB cap on data: URLs

This again is just a limitation of IE's implementation. (Though one has to 
wonder, why would you generate a URL of more than 32KB? Wouldn't it make 
more sense to generate the part that changes, and then fetch the rest as 
part of an importScripts() call?)


> * Something we both missed in the data: URL RFC is that the only 
> characters allowed in a data: URL are characters allowed in an URL; this 
> makes putting useful code in a data: URL impossible.

Characters can be escaped using %xx escapes, so this isn't a problem.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Friday, 8 August 2008 08:29:32 UTC