Fixing appcache: a proposal to get us started from Jonas Sicking on 2013-03-26 (public-webapps@w3.org from January to March 2013)

From: Jonas Sicking <jonas@sicking.cc>
Date: Tue, 26 Mar 2013 00:02:58 -0700
To: Webapps WG <public-webapps@w3.org>
Message-ID: <CA+c2ei86S1dELQ4uvLaAwUaF_MMXZt9fMmgNa0QsuChWg3cZZg@mail.gmail.com>
Hi WebApps!

Apologies in advance for a long email. This is a complex subject and I
wanted to present a coherent proposal. Please don't be shy about
starting separate threads when providing feedback.

There has been a lot of debating about "fixing appcache". Last year
mozilla got a few people together mostly with the goal of
understanding what the actual problems were. The notes from that
meeting are available at [1].

Those discussions, and a few followup ones, has made it clear that
there were a few big ticket items that we needed to fix:

* The fact that master entries are automatically added to the cache
works very poorly for a lot of developers.
* Once a website is cached the user will only see the new version on
second load, even if the user is online. This is good for performance
but is a behavior many websites aren't willing to live with.
* You have to tweak a comment of the manifest in order to trigger an
update-check of the cached resources.
* We need an "escape hatch" for people running into missing features
in the appcache. I.e. a way for websites to use script to complement
the set of behaviors supported by the appcache spec.
* The fact that FALLBACK combined the "hit network first, fall back to
a cached resource" and "allow a cached resource to handle requests to
a whole URL space" behaviors is problematic since many times you want
one and not the other.
* People want to use appcache not just to make offline apps possible,
but also make online apps fast.
* There isn't enough ability to control the appcache through javascript.

There are certainly other things that people have mentioned, but the
above have been a reoccurring theme. Feel free to comment here if you
have other issues with the current appcache, but it might be worth
doing that as separate threads.

I believe that some of these problems stem from a relatively small set
of design problems:

The appcache appears to be aimed at too simple applications. It works
fine if the website you want to cache consists of a small set of
static resources and otherwise only use features like IndexedDB or
localStorage to manage dynamic data. But once an application uses
server-side processing to dynamically generate resources based on
query parameters or other parts of the URL, then it requires that you
change the way that your application works.

Another design aspect that appears to be causing problem is that
appcache is optimized too heavily for minimizing the amount of typing
that the author had to do. It attempts to help the author too much,
for example by automatically adding master entries that link to an
appcache but aren't enumerated in it. Or automatically adding the
"handler" URLs from the FALLBACK section to the set of URLs to be
automatically downloaded.

The result is that the appcache contains too much "magic". In theory
an author can just type very little and the appcache will
automatically do the right thing. However this magic is making it too
hard for authors to understand what's going on. The result is that
people don't use the appcache even if they might have needed to type
very little to get it working. Implementations certainly hasn't helped
here either, by not exposing the behind-the-scenes logic through
debugging and developer tools.

The fact that the appcache is aimed at simple applications is
generally a good thing given that it's the first version. However the
desire to make applications available offline, as well as make them
faster when the user is online, has been so great that people have
wanted to use the appcache to solve a larger set of types of
applications. So to some extent the appcache has been a victim of its
own success.

The other week a few of us at mozilla got together to discuss how to
"fix appcache". I.e. how to come up with a solution for the above
mentioned problems. We came to a few conclusions.

We still want to try to keep a declarative solution. While the current
appcache appears to only work really well for very simple
applications, we hope that it is possible to find a declarative format
which is simple enough to be understandable and practical, while still
supporting a large number of applications.

However, we do think that there needs to be a script-driven
"fallback". A declarative solution can't ever cover everything that
people want to do. For example some websites will want to use complex
algorithms for determining when a given resource is out-of-date, in
order to avoid redownloading a new version of a script, when the
existing version is "good enough". Others are using more complex URLs
scheme which means that prefix matching doesn't work. Such a script
based solution also has the benefit that it can feed into future
versions of a declarative format.

This leaves the question which exact feature set should we put in the
declarative solution. This is a very tricky question and likely
something we'll have to iterate a lot on after getting feedback from
authors. One thing to keep in mind here is the most complex websites
out there are likely ones that we'll never be able to capture fully
with a declarative format. I'm not actually still 100% convinced that
we can find a feature set which is appropriate to put in a declarative
solution. But I'm really hoping we can so I definitely want to give it
a shot.

So, with that as background, let me present the proposal that we
currently have. The feature set that we aimed for is:

* A set of URLs to be downloaded and cached.
* Control over online behavior. I.e. what to do if resources are
cached but the user is online.
* Choose if an update-check is done by only checking the manifest or
by checking all the resources linked to by the manifest. And if update
checking is done by checking for updates of the manifest, use an
explicit version indicator rather than simply a comment.
* A way to map URL spaces to be handled by specific resources.
* Ability to specify last-modification-date or etags for individual
resources so that we don't need to do if-modified-since or
if-none-match requests for those.
* Sub-manifests. I.e. a way to add another manifest to a cache which
causes all the resources from that manifest to also get cached.
* A way to invalidate an appcache automatically if a user-identifying
"login-cookie" changes.
* Javascript API to allow control over an appcache.
* A way to "plug in" a webworker to handle network requests in order
to support more advanced usecases.

Another "feature" that we are proposing is to drop the current
manifest format and instead use a JSON based one. The most simple
reason for this is that we noticed that the information we need to
express quickly became complex enough that using a format with simple
parsing rules was beneficial.

A format based on extending the current appcache format would be no
problem for a UA to parse. However the complexity that we need to
express resulted in something that's too hard for a human to manually
write, or for a human to understand when looking at somebody else's
manifest in order to learn.

The simple parsing rules for JSON seemed like a better fit. It also
provides more of an opportunity to extend the format in the future.
JSON also has advantages when it comes to creating APIs exposed to
webpages for interacting with appcaches. More about this below.

That said, we are definitely open to exploring expanding the current
manifest format to support the same feature set. Proposals welcome.

So, a very simple manifest would look something like:

{
  "expiration": 300,
  "cache": ["index.html", "index.js", "index.css"]
}

If the user navigates to index.html The following happens:

If the user is online and we haven't checked for update for the
appcache in the last 5 minutes (300 seconds) we simply ignore the
cache and load index.html and any resources it links to from the
network. We'd simultaneously kick off an update check for the
appcache.

If the user is offline, or if we checked for update for the appcache
within the last 5 minutes, we use the index.html from the appcache
without hitting the network first. If index.html uses index.js or
index.css, those will be immediately loaded from the cache. If any
other resources are used those will be loaded from the network.

Whenever we check for updates for an appcache with the above manifest
we do an if-modified-since/if-none-match for the manifest. We then do
an update check for any resource requested by the manifest. I.e. even
if the manifest hasn't changed we still do an update check for each
resource linked to by the manifest. If any resources were added since
the previous manifest those are obviously simply downloaded. If any
resources were removed from the manifest those are discarded. As an
optimization the UA can start doing update checks on the same set of
URLs that the previous version of the manifest contained.

In order to avoid having to do update checks for all resources, the
manifest can opt in to only checking the manifest and if it hasn't
changed it is assumed that no resources have either. This would look
like:

{
  "version": "5.1",
  "expiration": 300,
  "cache": ["index.html", "index.js", "index.css"]
}

For this manifest, when we want check for update for the cache we
first do a if-none-match/if-modified-since check for the cache object
itself. If we get back a new resource, *and* that resource contains a
new value for the version property, then we do update checks for all
resources as well as download any new ones.

Potentially we should make opting in to this behavior more explicit
than simply having a version number. I.e. we could add a
"revalidateonlyonversionchange" property (with a better name) which is
what triggers the version check, rather than simply the presence of
the "version" property.

In order to further cut down on the number of network requests, we'd
also enable providing last-modified dates or etags directly in the
manifest:

{
  "expiration": 300,
  "cache": [{ url: "index.html", "etag": "84ba9f"},
            { url: "index.js", "last-modified": "Wed, 1 May 2013
04:58:08 GMT" },
            "index.css"]
}

In other words, each entry in the "cache" array can either be a
string, in which case it's interpreted as a URL to cache, or an object
with a "url" property, in which case the value of the "url" property
is interpreted as the URL to cache, and other properties are treated
as metadata about that URL.

If the etag or the last-modified matches what is already cached, then
no if-modified-since/if-none-match request is made. This works both in
the scenario where a "version" property is set but has changed since
the last update check, as well as when no "version" property exists.


Additionally we need to support the scenario where a single server
script handles a whole URL space. For example in a bug-tracker, each
bug has a URL like "http://example.com/show_bug?id=1234". When such a
website is cached using appcache we unlikely want to download the full
page for each such URL. This would result in the template for the
show_bug page being downloaded for each bug that is cached. A better
solution is to download a single page which contains the template and
then download just the bug data for the bugs that you want to make
available offline.

To support this we introduce a "urlmap" property:

{
  "expiration": 300,
  "cache": ["index.html", "index.js", "index.css",
            "show_bug_handler.html"],
  "urlmap": [
    {
      url: "show_bug?id=*",
      page: "show_bug_handler.html"
    }
  ]
}

The '*' above isn't interpreted as a regular expression. Instead any
url which ends with exactly a '*' is interpreted as a prefix. So
requests for URLs that start with "show_bug?id=" are handled by the
rule above. When a request to such a URL is made, we immediately
return the cached "show_bug_handler.html" resource. However the
location object will still reflect the URL that was originally
requested. This allows "show_bug_handler.html" to get the id of the
bug that was requested and either fetch the bug data from the network,
or fetch data that is cached locally, for example in IndexedDB.

It might even be useful to allow specifying something like

{
  "expiration": 300,
  "cache": ["index.html", "index.js", "index.css",
            "show_bug_handler.html"],
  "urlmap": [
    {
      url: ["show_bug?id=*", "bug_summary.html", "bug_query.cgi"],
      page: "show_bug_handler.html"
    }
  ]
}

Here "show_bug_handler.html" is used to handle any URL which starts
with "show_bug?id=", as well as the URLs "bug_summary.html" and
"bug_query.cgi".

Multiple such rules could be defined using something like

{
  "expiration": 300,
  "cache": ["index.html", "index.js", "index.css",
            "show_bug_handler.html", "forum_handler.html"],
  "urlmap": [
    {
      url: "show_bug?id=*",
      page: "show_bug_handler.html"
    },
    {
      url: ["forum?thread=", "forum_overview.html"]
      page: "forum_handler.html"
    }
  ]
}

Another feature that has been problematic for developers is handling
of websites that allow users to log in and serve user-specific
content. Specifically what is the issue is once the user logs out,
resources that are stored in the appcache might have been downloaded
while the user was logged in and are thus specific to that user. If
the user logs out and another user uses the same device, that user
might navigate to an appcached URL and thus see the content of that
other user.

If the UA had awareness of when a user is logged in or not, the UA
could automatically handle this. So for websites that use
HTTP-authentication, the UA could choose to not serve an appcache if
it was created while a different authentication header was used.

However logins are today almost exclusively handled by simply setting
a user-identifying cookie, which to the UA has no meaning. To fix this
we propose to add a header which indicates to the UA to not use a
particular appcache if the value of a specific cookie has changed. So
something like:

{
  "expiration": 300,
  "cache": ["index.html", "index.js", "index.css"],
  "cookie-vary": "uid"
}

This would mean that even if the user is offline and navigates to
index.html, if the value of the "uid" cookie is different from when
the appcache was last updated, the appcache would not be returned. A
UA could even use the value of the "uid" cookie as an additional key
in its appcache registry and thus support keeping appcaches for
different users on the same device.

The final manifest feature that we'd like to propose is the ability to
hook up a webworker as handler for network requests.

{
  "expiration": 300,
  "cache": ["index.html", "index.js", "index.css",
            "show_bug_handler.html", "forum_handler.html"],
  "network-controller": "httpworker.js"
  "urlmap": [
    {
      url: "show_bug?id=*",
      page: "show_bug_handler.html"
    },
  ]
}

The idea here is that the script in "httpworker.js" is started in a
shared-worker-like worker. When a request to a URL which isn't cached
or mapped happens, an event is fired in this worker. The worker then
has available API to read the details of the request and send whatever
it wants as a response. This means that it could download a response
through the network, or it could load a file from IndexedDB and use
that as response.

The details of the API in this worker is something we haven't looked
at yet. It's a very big task in and of itself. Fortunately, Alex
Russell and a few others have worked on a proposal for exactly this at
[2]. The intent is for these two proposals to be aligned such that
they work well together. They are already very complementary in their
feature sets, so this should not be a problem. However this is
something that we've just started looking at, and since both proposals
are still under heavy development, I didn't want to wait until they
are both aligned before publishing what we have so far.

This is the basics of the manifest part of the proposal. In addition
to this we need a Javascript API to allow more fine-grained control
over the behavior of the cache.

Please, please, please don't dismiss the API due to crappy names. I
know many of the properties have too long names. I decided to stick to
long names for now as to make the properties more self documenting
since no documentation exists. Suggestions for shorter names
encouraged.

First we need a way to get at AppCache objects:

partial interface Navigator {
  Future<AppCache> installAppCache(url);
  Future<AppCache> getAppCache(url);
  Future<boolean> removeAppCache(url);
  Future<DOMString[]> getAppCacheList();
}

partial interface Document {
  AppCache appCache;
  readonly attribute boolean appCacheUpdateAvailable;
  attribute EventHandler onappcacheupdateavailable;
}

The API on the Navigator object allow installing, removing and getting
AppCache objects for the current website. AppCache objects are keyed
on the url of the manifest.

One of the design goals here has been to always have a URL associated
with an appcache manifest. This allows the UA to automatically update
the appcache to the latest version even if the user is not on the
website. This is something that we're interested in doing for Firefox
in cases when we notice that the user is using a particular website a
lot.

The appCache property on the Document object only returns a non-null
value if the current document was loaded through an appcache. However
a document can always use navigator.getAppCache in order to get the
AppCache object that this document would have used if that AppCache
object had been up-to-date.

The "appCacheUpdateAvailable" property indicates if a later version of
the appcache used by this document has been detected *and* downloaded.
I.e. it indicates that new resources would be used if the page was
reloaded.

The Future interface is currently being defined at [3]. It's intended
to be a standardized promise. I've invented a bit of syntax here to
indicate what type the Future produces when it's successful. So
"Future<AppCache>" means that the function returns a Future object,
which when resolved provides an AppCache object.

The actual AppCache object has the following API:

interface AppCache : EventTarget {
  Object manifest;

  // Managing sub-manifests
  Object getSubManifest(url); // returns null if manifest not added
  Future<void> addSubManifest(url);
  Future<boolean> removeSubManifest(url); // throws if you remove main manifest?

  Future<void> cacheURL(DOMString url, CacheURLOptions);
  Future<boolean> removeCachedURL(url); // Throws if trying to remove
something from manifest?

  Future<boolean> isCached(url);

  Future<AppCacheError[]> getErrorLog();

  readonly attribute Date installed;
  readonly attribute InstallStateEnum installState;
  readonly attribute boolean downloadAvailable;
  readonly attribute boolean downloading;
  void download();
  void cancelDownload();

  attribute EventHandler ondownloadavailable;
  attribute EventHandler ondownloading;
  attribute EventHandler ondownloadsuccess;
  attribute EventHandler ondownloaderror;

  readonly attribute Date? lastUpdateCheck;
  Future<boolean> checkForUpdate();

};

dictionary CacheURLOptions {
  DOMString etag,
  Date lastModified
};

enum InstallStateEnum {
  "pending",
  "installed",
  "updating"
};

interface AppCacheError {
  DOMString url;
  DOMString httpStatus;
  DOMString httpStatusText;
  Date date;
  ??? additional information needed
}


The "manifest" property gives access to the full contents of the
manifest. The object returned from this property is the result of
passing the manifest contents to JSON.parse() and then deeply freezing
the returned object.

The next section of properties is for managing sub-manifests. This
enables a website to have separate manifests for separate parts of the
website and then dynamically add or remove the parts that should be
made available offline (or made available faster when the user is
online).

For example a website like wikipedia could keep one manifest for each
wikipedia article. This manifest would request that the article-page
itself, as well as any images needed by it, was cached. Based on
application-level logic wikipedia could then call addSubManifest
whenever another article should be cached. For example each article
could show some UI allowing the user to make the current article
available offline. When clicked, the page would use getAppCache to
grab the top-level AppCache object and call addSubManifest to add the
manifest for the current article.

Whenever the top-level manifest is then updated, so would all articles
that had been linked to it.

The addSubManifest and removeSubManifest functions allow adding and
removing sub-manifests. The getSubManifest function returns the
manifest as a "JSON object" of an added sub-manifest, or null if the
manifest url hasn't yet been added to this AppCache object.
Potentially we should also enable enumerating sub-manifests in the
appcache manifest.

The cacheURL and removeCachedURL functions allow adding and removing
individual resources. RemoveCachedURL is not allowed to remove any
resources enumerated by the manifest or any sub-manifests since this
would create ambiguous situations when that resource is later updated
or if it's added by another submanifest. Instead it's only allowed to
remove resources added through cacheURL.

The use-cases for cacheURL and removeCachedURL is approximately the
same as for sub-manifests, but is intended to be used when individual
resources needs to be dynamically cached, rather than a whole
sub-section of a website.

The isCached function checks if a URL has been cached either though
the main manifest, a submanifest or through cacheURL.

The getErrorLog function retrieves a log of errors that has occurred
since the last time getErrorLog was called. The idea is that whenever
the implementation runs into a problem as it's trying to update an
appcache it logs an error in an internal log which is kept
per-AppCache. Since this can happen even if the website isn't
currently open, we can't simply fire an event which contains the error
information. Instead an entry is added to the log. The website can
then grab and flush the log using getErrorLog, most likely to upload
it to the server for human processing and bug tracking.

So getErrorLog should asynchronously return an array of AppCacheError
objects. I'm not quite sure about what exactly to include in these
objects so suggestions welcome. We likely have to severely limit the
type of information we can expose for cross-origin requests though
unfortunately.

The last group of properties are for managing updates of the cache.
The idea with this part of the API is to basically treat an appcached
website as an "app". This enables a website to detect if an update is
available as well as to download that update and detect when the
download has finished.

This part of the API is admittedly quite complex. Suggestions for how
to simplify it are welcome. The goals with this set of properties is:
* Enable the website to build UI which tells a user when an update is
available for a particular part of the website, as well as when that
update is downloaded.
* Allow the UA to update an appcache based on its own heuristics or UI
and allow a currently-open website to update its UI whenever this
happens.
* Enable a website to update AppCache objects when visiting other
pages on the website.

For websites that are happy to let the UA handle updates, this API
doesn't need to be used at all.

The "installed" property returns the Date when the AppCache object was
first created. This happens before all the cached resources for the
AppCache were successfully downloaded.

The "installState" is initially "pending" when an AppCache object is
created. Once all the resources enumerated in the manifest has been
downloaded it changes state to "installed". Once an update is detected
and started to be downloaded, the value changes to "updating" until
the update has been fully downloaded.

The "downloadAvailable" attribute returns true if an update has been
detected but is not yet fully downloaded.

"downloading" returns true if the appcache is currently in the process
of downloading an update for this appcache. It's also true if the
appcache is doing the initial download of the appcached resources.

The "download" function allows a website to download an update for an
appcache object even if the user isn't currently using that appcache.

The "cancelDownload" function allows such an update to be cancelled,
either if the website itself triggered the download, or if the UA
triggered it either by the user browsing to a page that used that
appcache, or if it thinks the user is using the cache often enough
that it's keeping it up-to-date.

The "lastUpdateCheck" attribute and "checkForUpdate" function lets a
page trigger a manual update check. This is useful for security
conscious pages that want to make sure that an update is rolled out
immediately when it's available. Right now some websites hand-roll
this functionality by checking using an XHR object to check in with
the server.


This is basically the whole thing. Below are some additional
implementation requirements as well a pile of open questions (some of
which are basically a brain-dump so please ask if they are
incomprehensible).


Implementation requirements:
If update fails, don't throw away resources but rather re-attempt to
download the missing ones at next update time.

Can cache URLs on any server, but never captures cross-origin URLs.
The URLs can be cross-origin from the manifest, but not cross-origin
from the HTML page that links to the manifest. I.e. the origin of an
AppCache is determined by the origin of the HTML page that created it.
HTML-pages can't be cross-origin from that.

The expiration attribute for the manifest overrides the caching
headers for the manifest URL.

Allow cross-origin manifest using CORS (or other opt-in)

Don't use heuristics for estimating expiration dates for URLs cached
in the appcache. Explicit headers are honored (unless overridden by
the appcache manifest), but heuristics based on last-modification
dates or similar are not allowed.


Outstanding questions:
* What should happen if an appcache caches index.html, but index.html
links to another cache? If it doesn't link to any cache we should
probably treat that as if it linked to the cache, but what to do if it
explicitly links to another cache?
* What should happen if a version property exists and contains the
same value, but resources have been added or removed. Or have had
their etags/last-modified changed.
* Should we add support for "optional urls"? I.e. once which are ok if
they fail to download? If so, do we need to specify handling for a
failed download (use old version vs. use 404)
* Rather than using the map feature to handle cache-busting URLs,
should we introduce a list of URLs for which to "ignore query
parameters"?
* Is a "capture" feature needed? I.e. a list of URLs which if the user
navigates to, the appcache should be used. This would have to be a
subset of the set of URLs that the appcache has cached or mapped.
*  the map-to-worker feature could allow setting the worker property
to a worker-url in order to support multiple httpworkers. Needed?
* How do we solve moving appcache manifests? Can that even be done
while also supporting the browser updating the appcache automatically
without the user visiting the website?
* This is not solving Microsoft's use case of having multiple apps on
the same URL.
* This doesn't contain a "check network, otherwise use cached
resource" ability. Could be added through additional "map" rules if we
need it.
* Do we need some feature to avoid getting broken sites if we're
downloading an appcache just as the app is being updated - how about a
notion of "this resource is compatable with manifest etag W/rev-5."
either in the representation or it's HTTP headers.
* We could add the ability to say "force revalidate" on the urls in cache.
* Do we need the captive-portal-detection feature?
* How do we support comments in the manifest? One way would be to use
some for of extended JSON which supports comments. Another way would
be to advocate people sticking properties named "//" in the manifest.
* Should the main appcache be considered "ready to use" even if all
submanifests are not yet downloaded? This could enable loading one
part of a website even if later sections are still loading.

[1] http://etherpad.mozilla.org/appcache
[2] https://github.com/slightlyoff/NavigationController
[3] https://github.com/slightlyoff/DOMFuture

/ Jonas
Received on Tuesday, 26 March 2013 07:03:57 UTC