RE: Resource Timing - What's included from Nic Jansma on 2011-03-23 (public-web-perf@w3.org from March 2011)

From: Nic Jansma <Nic.Jansma@microsoft.com>
Date: Wed, 23 Mar 2011 18:48:57 +0000
To: Kyle Simpson <getify@gmail.com>, "public-web-perf@w3.org" <public-web-perf@w3.org>
Message-ID: <F677C405AAD11B45963EEAE5202813BD19CB9323@TK5EX14MBXW651.wingroup.windeploy.ntde>
>> 1) Resources that are already in the browser's disk cache, for 
>> example, from loading the page yesterday: *would* be included in the RT arrays.
>> Examples id="4" and id="5" below show this.
>
> Agreed, these definitely need to be included. Will there be some flag that indicates where it came from ("cache", "network", etc)?
> I think there definitely should be.

The problem with explicitly exposing "cache" vs. "network" is that it precisely exposes privacy information about the page's visitor.

For example, you could construct an "attack" page where you include, as images, 10,000 images from well-known bank sites.  Then, as the page is loading, you could iterate over the RT array and know exactly where the user has been, what banks they belong to, etc.

Arguably you can do this today, by dynamically constructing IMG elements on the page, and noting (via new Date()), how long it takes the for the load event to fire.  If this is < 10ms, for example, you have a good guess that the image was in the cache.

The constraint we would like to stick to with RT is that we neither expose additional user-private information, nor higher-precision user-private information.  If we expose a "from cache" flag, we are exposing higher-precision information (a definitive answer vs. an educated guess) that the user had the resource in their cache.

With the RT array, we include the same timestamps you can get with new Date(), but nothing more.

>Actually, I think this assumption is not entirely correct. I have a JavaScript loader called LABjs, and in some browsers it operates in a "cache preloading" hacky method,
> where it makes a request for a script using a method that is guaranteed to download but *not* execute it (either by using a fake mime-type, or by using a <object>
> or Image container). Then, when appropriate, a second proper script element request is made for the same URL resource, making the assumption of course that
> the previous request successfully cached it. This second request being from a proper container/type, of course it then executes.
>
> But, the point is, in that scenario, in almost all browsers, I see both requests logged (IE9, Firefox, Chrome, etc). That's because the browser will still have to pull
> that second request from the browser cache.

I think we're on the same page -- we both want RT to expose the "observed" behavior of browsers.

My example below was a simplification of the issue, and meant to point out one optimization that I believe all current modern browsers implement.  For *static* elements within the page (e.g. <IMG />), current browsers re-use prior duplicate resource URLs instead of downloading them twice.  From my simple HTML example, only one resource request for 1.jpg would occur.  Current browsers don't re-check the cacheability of resource within the *same page* for *static* resources.

Your LABjs framework (cool!) shows that the observed behavior of browsers changes when interacting with dynamic resources.  For example, it looks like you're setting the .src link on a script via JavaScript.  XHR is a similar example.  In this case, modern browsers will have to re-validate the resource, though they still may pull it from the cache.  These "duplicate" resource URLs would both be captured in RT.

Here's the example I looked at: http://labjs.com/test_suite/test-LABjs-preloading.php?which=1 

In your LABjs framework, I would agree that the "duplicate" requests for the same resource later in the page -- because they were both dynamically inserted -- should be included in the RT array.

In summary, if the browser is initiating additional download requests, then by all means, I think it should be included in the RT array.  If it needs to look at the cacheablilty of resource before re-using them (XHRs, setting src=), then yes, it should be included in the RT array.

Note: I incorrectly captured this in the four XHR cases in my original email.  All four XHRs should be included, not 3.  The 2nd cache-able resource, while it doesn't go to the network, is validated in the browser's cache as a "downloadable" resource (and serviced from there).

>> I would agree with you that the HTTP status code of the resource 
>> should not exclude it from the RT array.  404/500/etc should all be 
>> included.  If the browser "initiates" a request, whether or not it was 
>> completed, we should include it in the array.
>
>Yes, and furthermore, I think (mostly for data filtering purposes) having the status code actually in the data structure would be important. For instance, if a
> tool analyzing this data wants to filter out all 404's, etc. 
>Same goes for, as mentioned above, having an indicator of where the resource came from ("network", "cache", "cache-revalidated", "cache-duplicate", etc).

There may be some privacy/security aspects about exposing the HTTP response code, especially for cross-origin domains.  For example, the presence of a 301 response from a login page on another domain could indicate that the user is already logged in.

Including the HTTP response code for the same-origin domains sounds like a good idea, but I'm afraid for cross-origin domains it would fall into the same bucket as zero'ing out the network latency timestamps.

- Nic
Received on Wednesday, 23 March 2011 18:49:34 UTC