[whatwg] AppCache-related e-mails from Ian Hickson on 2011-08-02 (public-whatwg-archive@w3.org from August 2011)

From: Ian Hickson <ian@hixie.ch>
Date: Tue, 2 Aug 2011 22:43:30 +0000 (UTC)
Message-ID: <Pine.LNX.4.64.1108012234530.1701@ps20323.dreamhostps.com>
On the subject of diagnostics for appcache:

On Wed, 8 Jun 2011, Patrick Mueller wrote:
> On Wed, Jun 8, 2011 at 15:21, Ian Hickson <ian at hixie.ch> wrote:
> > On Tue, 1 Feb 2011, Patrick Mueller wrote:
> > >
> > > I just tested Chrome beta this morning and saw nothing interesting 
> > > in appcache error events, however progress events have now grown 
> > > "loaded" and "total" properties (think those were the names, and I 
> > > think they're new-ish).  That's nice, as I can provide a progress 
> > > meter during cache load/reload.  I wouldn't mind having the URL of 
> > > the resource being loaded (that was just loaded?) as well as those 
> > > numbers.  And for errors it would be nice to know, in the case of an 
> > > error caused by a cache manifest entry 404'ing (or otherwise 
> > > unavailable), what URL it was. HTTP error code, if appropriate, etc.
> >
> > In theory, we don't want to expose this information because it can be 
> > used to introspect intranets.
> 
> I never considered that "introspect internets" angle.  I guess the 
> thought is that a rogue site could send a manifest with pointers to 
> files inside someone's intranet, and then get someone inside that 
> intranet to load that manifest, at which point JavaScript could have 
> access to which URLs returned 200's, etc.  Nasty.

Right.


> Is this just an issue if the manifest or originating document's origin 
> is different than a file listed in the manifest itself?  Perhaps errors 
> on these entries would less diagnostic data available for them - perhaps 
> no diagnostic data.  That would kind of fit with other cross-origin 
> access capabilities.

That might work.


> > What kind of information would be most useful? Should it be in the 
> > same format from every browser or should it be detailed and freeform?
> 
> Start with URL, because we know a URL was involved.  Then allow for an 
> optional vendor-specific freeform message.

Vendor-specific messages end up being parsed by scripts, and shortly after 
that we end up having to hard-code those messages as the spec.

So I'd rather not add a freeform message!

What is the URL for? Can you describe the way this information would be 
used in a user interface or however it would be used?

I'm just trying to make sure we address the actual problems that need 
addressing.


Regarding TLS and cross-origin requests:

On Thu, 16 Jun 2011, Michael Nordman wrote:
> > On Tue, 8 Feb 2011, Michael Nordman wrote:
> > >
> > > Just had an offline discussion about this and I think the answer can 
> > > be much simpler than what's been proposed so far.  All we have to do 
> > > for cross-origin HTTPS resources is respect the cache-control 
> > > no-store header.
> > >
> > > Let me explain the rationale... first let's back up to the 
> > > motivation for the restrictions on HTTPS. They're there to defeat 
> > > attacks that involve physical access the the client system, so the 
> > > attacker cannot look at the cross-origin HTTS data stored in the 
> > > appcache on disk. But the regular disk cache stores HTTPS data 
> > > provided the cache-control header doesn't say no-store, so excluding 
> > > this data from appcaching does nothing to defeat that attack.
> > >
> > > Maybe the spec changes to make are...
> > >
> > > 1) Examine the cache-control header for all cross-origin resources 
> > > (not just HTTPS), and only allow them if they don't contain the 
> > > "no-store" directive.
> > >
> > > 2) Remove the special-case restriction that is currently in place 
> > > only for HTTPS cross-origin resources.
> >
> > On Wed, 30 Mar 2011, Michael Nordman wrote:
> > >
> > > Fyi: This change has been made in chrome.
> > > * respect "no-store" headers for cross-origin resources (only for 
> > > HTTPS)
> > > * allow HTTPS cross-origin resources to be listed in manifest hosted 
> > > on HTTPS
> >
> > This seems reasonable. Done.
> 
> I had proposed respecting the "no-store" directive only for cross-origin 
> resources. The current draft is examining the "no-store" directive for 
> all resources without regard for their origin. The intent behind the 
> proposed change was to allow authors to continue to override the 
> "no-store" header for resources in their origin, and to disallow that 
> override only for cross-origin resources. The proposed change is less 
> likely to break existing apps, and I think there are valid use cases for 
> the existing behavior where "no-store" can be overriden by explicit 
> inclusion in an appcache.

I guess we can restrict no-store to cross-origin HTTPS resources, but it 
seems far easier to explain that no-store in general is honoured. 
Otherwise you end up with these weird situations where some resources can 
be cached and some can't, and the only reason one can or can't be stored 
is where the manifest is, but only if it has no-store, etc... It gets 
rather confusing.

Also, what use cases are there for specifying no-store that don't apply 
across all resources?



On the topic of appcache being used to cache everything but the main page:

On Wed, 29 Jun 2011, Felix Halim wrote:
> On Thu, Jun 9, 2011 at 3:21 AM, Ian Hickson <ian at hixie.ch> wrote:
> > If you're not loading the main page from the cache, what does this 
> > gain you that regular HTTP caching doesn't?
> 
> Suppose the content of the main page change very often (like news site).
> In this case, you don't want to cache the main page since the users
> want to see the latest main page, not the cached ones when they open
> the main page later.
> However, should the network connectivity is down, the user should be
> presented with the cached main page.

This suggests you _do_ want the page cached -- you just want the browser 
to not use it by default.

This has numerous problems:

 - What if the cache is out of date compared to the main page? For 
   example, if a site changes its stylesheet and what classes it uses, the 
   main page will no longer match the styles that the user has cached. The 
   user's cache could be months old.

 - How do you determine what is a network error and what is not? As 
   written, the appcache mechanism neatly avoids having to define this, 
   instead using a whole bunch of signals such as redirects (captive 
   portals), 500s (site down), no network connectivity, etc, as indicators 
   that the cache shouldn't be updated, but this is all done 
   asynchronously so the user doesn't have to wait to see it.

 - It doesn't provide any performance improvement over HTTP caching. In 
   fact the only improvement over today's implementations is that the UA 
   will show the page if the network is down, but there's no reason that 
   the browser shouldn't just do this anyway. In fact, many browsers over 
   the years _have_ done this.


> The news content is fetched dynamically through XHR and stored in 
> localStorage. However, this complicates the news site (a major redesign 
> of the website is necessary).

A redesign is a given when moving to appcache. The feature wasn't designed 
to be retrofitted onto existing sites; it was designed so that new sites 
could be written to take advantage of it.


> The current HTTP Caching still checks whether the resources are 
> modified, but in app cache, we can explicitly say that they are not 
> modified unless we change the manifest hash.

It doesn't have to. You can set an expiry date to avoid this.


> So, in this case, HTML5 App Cache can help make regular online websites 
> far faster, as well as provide offline access should the network is down 
> (or the server is down).

I disagree. I don't think appcache adds anything here that HTTP can't do.


> This would make the online news site feels online when it's online and 
> offline when it's offline. I don't think HTTP Cache can serve the 
> content if the network / server is down.

Why not?


> If the main page is always cached, then the next time the user visits 
> the main page, it will (almost) always see the STALE content of the main 
> page.
>
> Then a split second later, the main page refreshes with the most 
> up-to-date version, which is very annoying to the users.

Appcache isn't intended to refresh the page once the cache is refreshed -- 
the normal use case is to just keep the user one version behind, 
essentially. It's not intended for caching data, only app logic.


> HTTP Caching requires server modifications on altering the headers and 
> is a non option for users that have no control on the server side.

Given how cheap it is to get hosting nowdays where the author can have not 
only complete control over the headers but root-level control on the 
machine, I really don't think this is a valid concern anymore.


> Also, many servers are mostly mis-configured on how to send the correct 
> headers

If we can't rely on correct configuration, then appcache isn't going to 
work. It relies on specific MIME types to work right.


> and some proxies may alter them on its way to the client.

(Do you have any data to support this?)

It seems to me that if you are assuming the proxy is hurting 
performance-improving HTTP headers, it's not safe to assume that it won't 
also hurt performance-improving HTML attributes.


> In fact, we can do even better than that by not fetching the MANIFEST 
> itself by including an (optional) manifest's HASH inside the HTML like:
> 
> <html useManifest="my.manifest" manifestHash="asdfasdfasd">
> 
> If not specified, then the my.manifest will always be checked for 
> modifications.

This would only work for this "only use the cache as an HTTP 
cache augmentation" feature, since normally the main file isn't fetched 
down so there's no hash to compare.

Checking the cache manifest has changed is very cheap, anyway. It's just 
one HTTP round-trip, and it isn't in the critical path, performance-wise, 
so it doesn't need to be quick.


> I think it means that we should be able to selectively update any file 
> in the manifest, rather than blindly updating everything if the 
> manifest's hash changes.

You don't blindly update everything. Normal HTTP rules apply.


> The ability to selectively update the cached files is very appealing. If 
> your resources are 5 MB, and you know you only want to update on a small 
> file of 1KB...
> 
> I believe the way the current App Cache updates everything if the 
> manifest file changes is just too inefficient. You can say it can be no 
> worse than HTTP Caching, but it can be made far better!

Getting this kind of thing right is very difficult. I'm not at all 
convinced it's worth it. Data showing what bandwidth or CPU savings this 
would involve in typical cases would be quite helpful in determining 
whether it's worth it or not.


> >> The application cache is very powerful. But it is very disappointing, 
> >> that it is only useful for static pages. With a little improvement to 
> >> the Offline Web applications chapter, and of course to the browsers, 
> >> it would be possible to cache any Content Manager or dynamic page. 
> >> And that would let the appcache become one of the most powerful 
> >> things in the world.
> >
> > HTTP caches already do most of this.
> 
> It's far harder to setup HTTP Cache properly, than a simple manifest file.

We don't fix problems with one technology by making another technology 
redundant with it.


> Even we setup HTTP Cache properly, it may still not work properly if 
> there are proxies. HTTP Cache is very fragile and not reliable.

That's a problem to raise with the HTTP working group.


> This "Dynamic Data" inside the main page is THE MAIN reason many people 
> DON'T WANT the App Cache to CACHE the main page!
> 
> Of course you can then say you should separate the "dynamic" from the 
> "static" and store the "dynamic" in the localStorage / indexedDB... 
> However, this is NOT what the current majority of websites like forums, 
> blogs, news sites were designed!

Sure. But the number of existing sites is dwarfed by the number of future 
sites. Appcache isn't designed for retrofitting, it's designed to be used 
with new sites that have it in mind.

If you want something you can retrofit onto existing sites, HTTP already 
provides all the tools you need, as far as I can tell.


> >> The current App Cache design updates the cache to the latest version 
> >> in the background when the user visit the page for the second time 
> >> and then it needs to refresh the page to actually update the display. 
> >> This is annoying since the user will first see stale data, then a few 
> >> second later, it's updated with a giant refresh (including all the 
> >> static resources).
> >
> > You shouldn't store data in the appcache, only logic, otherwise yes, 
> > the user will always be one version behind.
> >
> > Note that there is no giant refresh unless the page makes it so.
> 
> The page or the user MUST do giant refresh, otherwise the user do not 
> see the latest main page!

This is only true for people misusing appcache with legacy sites that put 
data in the main page instead of having static logic pages with separate 
dynamic data.


> >> That is another reason why we need pageStorage: to separate the 
> >> dynamic and the static resources.
> >
> > Don't we already have enough ways to store data?
> 
> pageStorage Quota is different from localStorage.

Sure, there are many different ways to store data, and they're all 
different from each other. I'm just saying we already have enough 
mechanisms, we don't need more.


> localStorage Quota is per domain, while pageStorage is per page. one 
> page may have entirely different unrelated dynamic data than another 
> page on the same domain.
>
> Their quota should be separated, otherwise the localStorage domain quota 
> will be too small if there are many pages in that domain.
> 
> This can give the browsers options to give quota based on PAGE rather 
> than based on DOMAIN.
>
> Which I think is more reasonable if each PAGE is unique even though they 
> are in the same DOMAIN.

Quota on a per-page basis doesn't work because an attacker would just use 
many different pages, which are trivial to construct. At least different 
origins require a modicum of effort to create.


On Thu, 30 Jun 2011, Bjartur Thorlacius wrote:
>
> Ask HTTP implementors to store a potentially stale fallback copy for 
> offline use when an authoritative copy is unavailable. Even HTTP caches 
> are allowed to return stale responses as long as they warn their clients 
> (so they can warn their clients or fetch an authoritative copy via 
> another route).
>
> Browsers should keep copies of the most used entries for offline use. 
> It's probably a matter of minor tweaking, considering that mainstream 
> browsers support offline modes already.
> 
> From http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.1.5: 
> In some cases, the operator of a cache MAY choose to configure it to 
> return stale responses even when not requested by clients. This decision 
> ought not be made lightly, but may be necessary for reasons of 
> availability or performance, especially when the cache is poorly 
> connected to the origin server. Whenever a cache returns a stale 
> response, it MUST mark it as such (using a Warning header) enabling the 
> client software to alert the user that there might be a potential 
> problem.

Indeed.


On Thu, 30 Jun 2011, timeless wrote:
>
> It's possible to build a main page so that it can update its content 
> using a subresource. You can use iframes, javascript (including json), 
> xmlhttprequests, or other things to do this.
> 
> Nothing requires you to have a monolythic main page which is incapable 
> of dynamically updating itself. ... If I visit your page on May 1st and 
> sit there for two months, does your page really just want to continue to 
> show me the same content when I glance at it on July 1st? It can show 
> other content if it wants to, and in order to save bandwidth costs, it 
> should avoid resending the framework which shouldn't be changing. Once 
> your page works well for this case, it should work well for app-cache.

Indeed.


On Fri, 1 Jul 2011, Felix Halim wrote:
> 
> Those are another option besides using localStorage. Again, those things 
> requires restructuring your website. I'm looking for a solution that 
> doesn't require modifying anything except adding a manifest.

I think you're better off just biting the bullet and doing the redesign. 
It'll make your site easier to maintain anyway, and will help you make a 
more modern, more fluid site, which really requires that the data be 
separate from the logic anyway. For example, you also need to separate 
your logic from your data to make good use of pushState(), another big 
performance improvement.


> As I said before, separating dynamic from the static will work, however, 
> if we don't have "pageStorage", even we have a clean dynamic separation, 
> it will quickly run out of space if we use "localStorage" since the 
> localStorage quota is per domain.

Nothing stops the localStorage quota from being equal to the sum of the 
pageStorage quotas.


> Let's see an example:
> 
> I have a dynamic page with this url:
> 
> http://bla/page?id=10
> 
> The content inside is changing very frequently, lets say every hour. Of 
> course, I want the browser to cache the latest version. So, it seemed 
> that AppCache is a perfect fit...
> 
> I then add the manifest to enable the App Cache, and what do I get?
> 
> Everytime I open that URL every hour, I ALWAYS see the STALE version 
> (the 1 hour late version). Then few seconds (or minutes) later (depend 
> on when the AppCache gets updated), I refresh, then I got the latest 
> content. Annoying, right?

Yes, that would be annoying. Don't do that, it's not the way to write 
pages these days. :-)

It also fails to handle timeless' scenario (to which the above was a 
reply): if you go to http://bla/page?id=10, and wait two hours, then the 
content is two hours old. Modern sites dynamically update the content so 
that it is always fresh, even without appcache.


> Now, let see the alternative: I build a framework to separate the 
> dynamic from the static. I have to make it so that only ONE MAIN PAGE 
> get cached by the app cache. So, my URL can NO LONGER BE:
> 
> http://bla/page?id=10
> 
> But it has to change to:
> 
> http://bla/page#!id=10

It's not clear to me what the id=10 means here, but with fallbacks and 
pushState() you can certainly still do the ?id=10 thing.


> Why do I have to do this? it's because if I DON'T, then each page will 
> be stored on different App Cache, and the "stale by one" still occurs! 
> That is,
> 
> http://bla/page?id=10
> 
> and
> 
> http://bla/page?id=11
> 
> Will be on DIFFERENT AppCache!

No, they'll have the same cache if they have the same manifest.


> Note that even though the dynamic content is "dynamic" it doesn't mean 
> that:
> 
> http://bla/page?id=10
> 
> has "shared" data with
> 
> http://bla/page?id=11
> 
> It can be totally different unrelated dynamic content.

I don't really follow. It's the same site, no?

Maybe a more concrete example would make this clearer.


On Fri, 1 Jul 2011, Michael Nordman wrote:
>
> A common request that maybe we can agree upon is the ability to list the
> manifests that are cached and to delete them via script. Something like...
>   String[] window.applicationCache.getManifests();  // returns appcache
> manifest for the origin
>   void window.applicationCache.deleteManifest(manifestUrl);

This is trivial to do already; just return 404s for all the manifests you 
no longer want to keep around.


> 0. [DONE] A means of not invoking the fallback resource for some error
> responses that would generally result in the fallback resource being
> returned. An additional response header would suite they're needs...
> something like...
> x-chromium-appcache-fallback-override: disallow-fallback
> If a response header is present with that value, the fallback response would
> not be returned.
> http://code.google.com/p/chromium/issues/detail?id=82066

What's the use case? When would you ever want to show the user an error 
yet really desire to indicate that it's an error and not a 200 OK response?


> 1. [UNDER CONFUSING DISCUSSION] Allow a syntax to associate a page with 
> an application cache, but does not add that page to the cache. A common 
> feature request also mentioned on the whatwg list, but it's not getting 
> any engagement from other browser vendors or the spec writer (which is 
> kind of frustrating). The premise is to allow pages vended from a server 
> to take advantage of the resources in an application cache when loading 
> subresources. A perfectly reasonable request, <http useManifest='x'>.

This feature request isn't reasonable, it makes no sense. HTTP caching 
already entirely handles this case.


> 2. Introduce a new manifest file section to INTERCEPT requests into a 
> prefix matched url namespace and satisfy them with a cached resource. 
> The resulting page would be free to interpret the location url and act 
> accordingly based on the path and query elements beyond the prefix 
> matched url string. This section would be similar to the FALLBACK 
> section in that prefix matching is involved, but different in that 
> instead of being used only in the case of a network/server error, the 
> cached INTERCEPT resource would be used immediately w/o first going to 
> the server.
>   INTERCEPT:
>   urlprefix redirect newlocationurl
>   urlprefix return cachedresourceurl
> 
> Here's where the INTERCEPT namespace could fit into the changes to the
> network model.
>    if (url is EXPLICITLY_CACHED)  // exact match
>      return cached_response;
>    if (url is in NETWORK namespace) // prefix match
>      return network_response_as_usual;
>    if (url is in INTERCEPT namespace) // prefix match <---- this is the new
> section
>      return handle_intercepted_request_accordingly
>    if (url is in FALLBACK namespace) // prefix match
>      return network_response_but_fallback_where_needed;
>    if (ONLINE_WILDCARD)
>      return network_response;
>    otherwise
>      return synthesized_error_response;

What's the use case here?


> 3. Allow an INTERCEPT cached resources to be "executable". Instead of 
> simply returning the cached resource or redirect in response to the 
> request, load it into a background worker context (if not already 
> loaded) and invoke a function in that context to asynchronously compute 
> response headers and body based on the request headers (including 
> cookie) and body. The background worker would have access to various 
> local storage facilities (fileSystem, indexed/sqlDBs) as well as the 
> ability to make network requests via XHR.
>   INTERCEPT:
>   urlprefix execute cachedexecutableresourceurl

What's the use case?


> 4. Create a syntax to allow FALLBACK resources to be similarly 
> executable in a background worker context.

What's the use case for this? How is it different from the last two?


> 5. Some kind of auto-update policy where the appcache is refreshed w/o 
> the app running.

That's already possible. The UA is allowed to refresh the appcache 
whenever desired by the user (explicitly or implicitly).


On Sun, 3 Jul 2011, Felix Halim wrote:
> 
> Remember that I also want those URL to be available even if the user is 
> offline. HTTP Cache is not that powerful, AppCache is.

There is no reason HTTP caching can't be this powerful.


> I do want to use shared cache for shared resources and "page cache" for 
> non-shared resources (unique to that page). However, the non-shared 
> resources will become too large to fit in 5MB quota. Remember I have 
> different non-shared content for id=10, id=11, ..., id=100000, I don't 
> think that will fit in localStorage.

The localStorage is not limited to 5MB. That's just a suggested initial 
quota per-origin (and would apply to all the pageStorages of an origin 
too, for the same reason). Nothing stops a user from granting more quota 
if they use the site a lot.


On Thu, 7 Jul 2011, Felix Halim wrote:
> 
> This is a real example. I build a site like:
> 
> http://uhunt.felix-halim.net/id/339
> 
> That is is mine, and there is another ids like:
> 
> http://uhunt.felix-halim.net/id/32900
> http://uhunt.felix-halim.net/id/1133
> 
> And thousands of other IDs.
> Usually people look into few dozens IDs and not all thousands of them.
> 
> Each ID has a large-unique-frequently-changing data attached to them
> (about 400KB).

Forget appcache; the simplest way to speed up this site is to do all the 
calculations on the server and serve up just what gets displayed! A PNG of 
the entire page would be less than 400KB!

The actual data seems to be about 150KB uncompressed in text form. It's 
not clear whether all that data has to be transmitted, nor whether it 
might not be possible to store and transmit it in a binary form or 
compressed (or both).


> Obviously, if I do a clean separation, and store the static framework in 
> AppCache, and the frequently changing data in localStorage, I can only 
> cache 10 ids data or so.

Store the data in a database (WebIndexedDB) as numeric data instead of as 
text, and you would not need to take as much space.


> What I want is a 5MB "pageStorage" quota per page id.

The quota is there to prevent a site from taking up a lot of disk space 
without user consent. We wouldn't ever be able to make it per-page. That 
would defeat the point of having a quota at all.


On Fri, 10 Jun 2011, Alexey Proskuryakov wrote:
> 
> Appcache API has everything to provide progress UI to the user, but with 
> every good progress bar, there goes a Cancel button.
> 
> I suggest adding an abort() method to ApplicationCache interface.

Done.


On Mon, 13 Jun 2011, Michael Nordman wrote:
>
> Let's say there's a page in the cache to be used as a fallback resource, 
> refers to the manifest by relative url...
> 
> <html manifest='x'>
> 
> Depending on the url that invokes the fallback resource, 'x' will be 
> resolved to different absolute urls. When it doesn't match the actual 
> manifest url, the fallback resource will get tagged as FOREIGN and will 
> no longer be used to satisfy main resource loads.
> 
> I'm not sure if this is a bug in chrome or a bug in the appcache spec 
> just yet. I'm pretty certain that Safari will have the same behavior as 
> chrome in this respect (the same bug). The value of the manifest 
> attribute is interpreted as relative to the location of the loaded 
> document in chrome and all webkit based browsers and that value is used 
> to detect foreign'ness.
> 
> The workaround/solution for this is to NOT put a manifest attribute in 
> the <html> tag of the fallback resource (or to put either an absolute 
> url or host relative url as the manifest attribute value).

Or just make sure you always use relative URLs, even in the manifest.

I don't really understand the problem here. Can you elaborate further?


On Fri, 1 Jul 2011, Michael Nordman wrote:
>
> Cross-origin resources listed in the CACHE section aren't retrieved with 
> the 'Origin' header

This is incorrect. They are fetched with the origin of the manifest. What 
makes you say no Origin header is included?

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Tuesday, 2 August 2011 15:43:30 UTC