- From: Jonas Sicking <jonas@sicking.cc>
- Date: Tue, 26 Mar 2013 00:02:58 -0700
- To: Webapps WG <public-webapps@w3.org>
Hi WebApps! Apologies in advance for a long email. This is a complex subject and I wanted to present a coherent proposal. Please don't be shy about starting separate threads when providing feedback. There has been a lot of debating about "fixing appcache". Last year mozilla got a few people together mostly with the goal of understanding what the actual problems were. The notes from that meeting are available at [1]. Those discussions, and a few followup ones, has made it clear that there were a few big ticket items that we needed to fix: * The fact that master entries are automatically added to the cache works very poorly for a lot of developers. * Once a website is cached the user will only see the new version on second load, even if the user is online. This is good for performance but is a behavior many websites aren't willing to live with. * You have to tweak a comment of the manifest in order to trigger an update-check of the cached resources. * We need an "escape hatch" for people running into missing features in the appcache. I.e. a way for websites to use script to complement the set of behaviors supported by the appcache spec. * The fact that FALLBACK combined the "hit network first, fall back to a cached resource" and "allow a cached resource to handle requests to a whole URL space" behaviors is problematic since many times you want one and not the other. * People want to use appcache not just to make offline apps possible, but also make online apps fast. * There isn't enough ability to control the appcache through javascript. There are certainly other things that people have mentioned, but the above have been a reoccurring theme. Feel free to comment here if you have other issues with the current appcache, but it might be worth doing that as separate threads. I believe that some of these problems stem from a relatively small set of design problems: The appcache appears to be aimed at too simple applications. It works fine if the website you want to cache consists of a small set of static resources and otherwise only use features like IndexedDB or localStorage to manage dynamic data. But once an application uses server-side processing to dynamically generate resources based on query parameters or other parts of the URL, then it requires that you change the way that your application works. Another design aspect that appears to be causing problem is that appcache is optimized too heavily for minimizing the amount of typing that the author had to do. It attempts to help the author too much, for example by automatically adding master entries that link to an appcache but aren't enumerated in it. Or automatically adding the "handler" URLs from the FALLBACK section to the set of URLs to be automatically downloaded. The result is that the appcache contains too much "magic". In theory an author can just type very little and the appcache will automatically do the right thing. However this magic is making it too hard for authors to understand what's going on. The result is that people don't use the appcache even if they might have needed to type very little to get it working. Implementations certainly hasn't helped here either, by not exposing the behind-the-scenes logic through debugging and developer tools. The fact that the appcache is aimed at simple applications is generally a good thing given that it's the first version. However the desire to make applications available offline, as well as make them faster when the user is online, has been so great that people have wanted to use the appcache to solve a larger set of types of applications. So to some extent the appcache has been a victim of its own success. The other week a few of us at mozilla got together to discuss how to "fix appcache". I.e. how to come up with a solution for the above mentioned problems. We came to a few conclusions. We still want to try to keep a declarative solution. While the current appcache appears to only work really well for very simple applications, we hope that it is possible to find a declarative format which is simple enough to be understandable and practical, while still supporting a large number of applications. However, we do think that there needs to be a script-driven "fallback". A declarative solution can't ever cover everything that people want to do. For example some websites will want to use complex algorithms for determining when a given resource is out-of-date, in order to avoid redownloading a new version of a script, when the existing version is "good enough". Others are using more complex URLs scheme which means that prefix matching doesn't work. Such a script based solution also has the benefit that it can feed into future versions of a declarative format. This leaves the question which exact feature set should we put in the declarative solution. This is a very tricky question and likely something we'll have to iterate a lot on after getting feedback from authors. One thing to keep in mind here is the most complex websites out there are likely ones that we'll never be able to capture fully with a declarative format. I'm not actually still 100% convinced that we can find a feature set which is appropriate to put in a declarative solution. But I'm really hoping we can so I definitely want to give it a shot. So, with that as background, let me present the proposal that we currently have. The feature set that we aimed for is: * A set of URLs to be downloaded and cached. * Control over online behavior. I.e. what to do if resources are cached but the user is online. * Choose if an update-check is done by only checking the manifest or by checking all the resources linked to by the manifest. And if update checking is done by checking for updates of the manifest, use an explicit version indicator rather than simply a comment. * A way to map URL spaces to be handled by specific resources. * Ability to specify last-modification-date or etags for individual resources so that we don't need to do if-modified-since or if-none-match requests for those. * Sub-manifests. I.e. a way to add another manifest to a cache which causes all the resources from that manifest to also get cached. * A way to invalidate an appcache automatically if a user-identifying "login-cookie" changes. * Javascript API to allow control over an appcache. * A way to "plug in" a webworker to handle network requests in order to support more advanced usecases. Another "feature" that we are proposing is to drop the current manifest format and instead use a JSON based one. The most simple reason for this is that we noticed that the information we need to express quickly became complex enough that using a format with simple parsing rules was beneficial. A format based on extending the current appcache format would be no problem for a UA to parse. However the complexity that we need to express resulted in something that's too hard for a human to manually write, or for a human to understand when looking at somebody else's manifest in order to learn. The simple parsing rules for JSON seemed like a better fit. It also provides more of an opportunity to extend the format in the future. JSON also has advantages when it comes to creating APIs exposed to webpages for interacting with appcaches. More about this below. That said, we are definitely open to exploring expanding the current manifest format to support the same feature set. Proposals welcome. So, a very simple manifest would look something like: { "expiration": 300, "cache": ["index.html", "index.js", "index.css"] } If the user navigates to index.html The following happens: If the user is online and we haven't checked for update for the appcache in the last 5 minutes (300 seconds) we simply ignore the cache and load index.html and any resources it links to from the network. We'd simultaneously kick off an update check for the appcache. If the user is offline, or if we checked for update for the appcache within the last 5 minutes, we use the index.html from the appcache without hitting the network first. If index.html uses index.js or index.css, those will be immediately loaded from the cache. If any other resources are used those will be loaded from the network. Whenever we check for updates for an appcache with the above manifest we do an if-modified-since/if-none-match for the manifest. We then do an update check for any resource requested by the manifest. I.e. even if the manifest hasn't changed we still do an update check for each resource linked to by the manifest. If any resources were added since the previous manifest those are obviously simply downloaded. If any resources were removed from the manifest those are discarded. As an optimization the UA can start doing update checks on the same set of URLs that the previous version of the manifest contained. In order to avoid having to do update checks for all resources, the manifest can opt in to only checking the manifest and if it hasn't changed it is assumed that no resources have either. This would look like: { "version": "5.1", "expiration": 300, "cache": ["index.html", "index.js", "index.css"] } For this manifest, when we want check for update for the cache we first do a if-none-match/if-modified-since check for the cache object itself. If we get back a new resource, *and* that resource contains a new value for the version property, then we do update checks for all resources as well as download any new ones. Potentially we should make opting in to this behavior more explicit than simply having a version number. I.e. we could add a "revalidateonlyonversionchange" property (with a better name) which is what triggers the version check, rather than simply the presence of the "version" property. In order to further cut down on the number of network requests, we'd also enable providing last-modified dates or etags directly in the manifest: { "expiration": 300, "cache": [{ url: "index.html", "etag": "84ba9f"}, { url: "index.js", "last-modified": "Wed, 1 May 2013 04:58:08 GMT" }, "index.css"] } In other words, each entry in the "cache" array can either be a string, in which case it's interpreted as a URL to cache, or an object with a "url" property, in which case the value of the "url" property is interpreted as the URL to cache, and other properties are treated as metadata about that URL. If the etag or the last-modified matches what is already cached, then no if-modified-since/if-none-match request is made. This works both in the scenario where a "version" property is set but has changed since the last update check, as well as when no "version" property exists. Additionally we need to support the scenario where a single server script handles a whole URL space. For example in a bug-tracker, each bug has a URL like "http://example.com/show_bug?id=1234". When such a website is cached using appcache we unlikely want to download the full page for each such URL. This would result in the template for the show_bug page being downloaded for each bug that is cached. A better solution is to download a single page which contains the template and then download just the bug data for the bugs that you want to make available offline. To support this we introduce a "urlmap" property: { "expiration": 300, "cache": ["index.html", "index.js", "index.css", "show_bug_handler.html"], "urlmap": [ { url: "show_bug?id=*", page: "show_bug_handler.html" } ] } The '*' above isn't interpreted as a regular expression. Instead any url which ends with exactly a '*' is interpreted as a prefix. So requests for URLs that start with "show_bug?id=" are handled by the rule above. When a request to such a URL is made, we immediately return the cached "show_bug_handler.html" resource. However the location object will still reflect the URL that was originally requested. This allows "show_bug_handler.html" to get the id of the bug that was requested and either fetch the bug data from the network, or fetch data that is cached locally, for example in IndexedDB. It might even be useful to allow specifying something like { "expiration": 300, "cache": ["index.html", "index.js", "index.css", "show_bug_handler.html"], "urlmap": [ { url: ["show_bug?id=*", "bug_summary.html", "bug_query.cgi"], page: "show_bug_handler.html" } ] } Here "show_bug_handler.html" is used to handle any URL which starts with "show_bug?id=", as well as the URLs "bug_summary.html" and "bug_query.cgi". Multiple such rules could be defined using something like { "expiration": 300, "cache": ["index.html", "index.js", "index.css", "show_bug_handler.html", "forum_handler.html"], "urlmap": [ { url: "show_bug?id=*", page: "show_bug_handler.html" }, { url: ["forum?thread=", "forum_overview.html"] page: "forum_handler.html" } ] } Another feature that has been problematic for developers is handling of websites that allow users to log in and serve user-specific content. Specifically what is the issue is once the user logs out, resources that are stored in the appcache might have been downloaded while the user was logged in and are thus specific to that user. If the user logs out and another user uses the same device, that user might navigate to an appcached URL and thus see the content of that other user. If the UA had awareness of when a user is logged in or not, the UA could automatically handle this. So for websites that use HTTP-authentication, the UA could choose to not serve an appcache if it was created while a different authentication header was used. However logins are today almost exclusively handled by simply setting a user-identifying cookie, which to the UA has no meaning. To fix this we propose to add a header which indicates to the UA to not use a particular appcache if the value of a specific cookie has changed. So something like: { "expiration": 300, "cache": ["index.html", "index.js", "index.css"], "cookie-vary": "uid" } This would mean that even if the user is offline and navigates to index.html, if the value of the "uid" cookie is different from when the appcache was last updated, the appcache would not be returned. A UA could even use the value of the "uid" cookie as an additional key in its appcache registry and thus support keeping appcaches for different users on the same device. The final manifest feature that we'd like to propose is the ability to hook up a webworker as handler for network requests. { "expiration": 300, "cache": ["index.html", "index.js", "index.css", "show_bug_handler.html", "forum_handler.html"], "network-controller": "httpworker.js" "urlmap": [ { url: "show_bug?id=*", page: "show_bug_handler.html" }, ] } The idea here is that the script in "httpworker.js" is started in a shared-worker-like worker. When a request to a URL which isn't cached or mapped happens, an event is fired in this worker. The worker then has available API to read the details of the request and send whatever it wants as a response. This means that it could download a response through the network, or it could load a file from IndexedDB and use that as response. The details of the API in this worker is something we haven't looked at yet. It's a very big task in and of itself. Fortunately, Alex Russell and a few others have worked on a proposal for exactly this at [2]. The intent is for these two proposals to be aligned such that they work well together. They are already very complementary in their feature sets, so this should not be a problem. However this is something that we've just started looking at, and since both proposals are still under heavy development, I didn't want to wait until they are both aligned before publishing what we have so far. This is the basics of the manifest part of the proposal. In addition to this we need a Javascript API to allow more fine-grained control over the behavior of the cache. Please, please, please don't dismiss the API due to crappy names. I know many of the properties have too long names. I decided to stick to long names for now as to make the properties more self documenting since no documentation exists. Suggestions for shorter names encouraged. First we need a way to get at AppCache objects: partial interface Navigator { Future<AppCache> installAppCache(url); Future<AppCache> getAppCache(url); Future<boolean> removeAppCache(url); Future<DOMString[]> getAppCacheList(); } partial interface Document { AppCache appCache; readonly attribute boolean appCacheUpdateAvailable; attribute EventHandler onappcacheupdateavailable; } The API on the Navigator object allow installing, removing and getting AppCache objects for the current website. AppCache objects are keyed on the url of the manifest. One of the design goals here has been to always have a URL associated with an appcache manifest. This allows the UA to automatically update the appcache to the latest version even if the user is not on the website. This is something that we're interested in doing for Firefox in cases when we notice that the user is using a particular website a lot. The appCache property on the Document object only returns a non-null value if the current document was loaded through an appcache. However a document can always use navigator.getAppCache in order to get the AppCache object that this document would have used if that AppCache object had been up-to-date. The "appCacheUpdateAvailable" property indicates if a later version of the appcache used by this document has been detected *and* downloaded. I.e. it indicates that new resources would be used if the page was reloaded. The Future interface is currently being defined at [3]. It's intended to be a standardized promise. I've invented a bit of syntax here to indicate what type the Future produces when it's successful. So "Future<AppCache>" means that the function returns a Future object, which when resolved provides an AppCache object. The actual AppCache object has the following API: interface AppCache : EventTarget { Object manifest; // Managing sub-manifests Object getSubManifest(url); // returns null if manifest not added Future<void> addSubManifest(url); Future<boolean> removeSubManifest(url); // throws if you remove main manifest? Future<void> cacheURL(DOMString url, CacheURLOptions); Future<boolean> removeCachedURL(url); // Throws if trying to remove something from manifest? Future<boolean> isCached(url); Future<AppCacheError[]> getErrorLog(); readonly attribute Date installed; readonly attribute InstallStateEnum installState; readonly attribute boolean downloadAvailable; readonly attribute boolean downloading; void download(); void cancelDownload(); attribute EventHandler ondownloadavailable; attribute EventHandler ondownloading; attribute EventHandler ondownloadsuccess; attribute EventHandler ondownloaderror; readonly attribute Date? lastUpdateCheck; Future<boolean> checkForUpdate(); }; dictionary CacheURLOptions { DOMString etag, Date lastModified }; enum InstallStateEnum { "pending", "installed", "updating" }; interface AppCacheError { DOMString url; DOMString httpStatus; DOMString httpStatusText; Date date; ??? additional information needed } The "manifest" property gives access to the full contents of the manifest. The object returned from this property is the result of passing the manifest contents to JSON.parse() and then deeply freezing the returned object. The next section of properties is for managing sub-manifests. This enables a website to have separate manifests for separate parts of the website and then dynamically add or remove the parts that should be made available offline (or made available faster when the user is online). For example a website like wikipedia could keep one manifest for each wikipedia article. This manifest would request that the article-page itself, as well as any images needed by it, was cached. Based on application-level logic wikipedia could then call addSubManifest whenever another article should be cached. For example each article could show some UI allowing the user to make the current article available offline. When clicked, the page would use getAppCache to grab the top-level AppCache object and call addSubManifest to add the manifest for the current article. Whenever the top-level manifest is then updated, so would all articles that had been linked to it. The addSubManifest and removeSubManifest functions allow adding and removing sub-manifests. The getSubManifest function returns the manifest as a "JSON object" of an added sub-manifest, or null if the manifest url hasn't yet been added to this AppCache object. Potentially we should also enable enumerating sub-manifests in the appcache manifest. The cacheURL and removeCachedURL functions allow adding and removing individual resources. RemoveCachedURL is not allowed to remove any resources enumerated by the manifest or any sub-manifests since this would create ambiguous situations when that resource is later updated or if it's added by another submanifest. Instead it's only allowed to remove resources added through cacheURL. The use-cases for cacheURL and removeCachedURL is approximately the same as for sub-manifests, but is intended to be used when individual resources needs to be dynamically cached, rather than a whole sub-section of a website. The isCached function checks if a URL has been cached either though the main manifest, a submanifest or through cacheURL. The getErrorLog function retrieves a log of errors that has occurred since the last time getErrorLog was called. The idea is that whenever the implementation runs into a problem as it's trying to update an appcache it logs an error in an internal log which is kept per-AppCache. Since this can happen even if the website isn't currently open, we can't simply fire an event which contains the error information. Instead an entry is added to the log. The website can then grab and flush the log using getErrorLog, most likely to upload it to the server for human processing and bug tracking. So getErrorLog should asynchronously return an array of AppCacheError objects. I'm not quite sure about what exactly to include in these objects so suggestions welcome. We likely have to severely limit the type of information we can expose for cross-origin requests though unfortunately. The last group of properties are for managing updates of the cache. The idea with this part of the API is to basically treat an appcached website as an "app". This enables a website to detect if an update is available as well as to download that update and detect when the download has finished. This part of the API is admittedly quite complex. Suggestions for how to simplify it are welcome. The goals with this set of properties is: * Enable the website to build UI which tells a user when an update is available for a particular part of the website, as well as when that update is downloaded. * Allow the UA to update an appcache based on its own heuristics or UI and allow a currently-open website to update its UI whenever this happens. * Enable a website to update AppCache objects when visiting other pages on the website. For websites that are happy to let the UA handle updates, this API doesn't need to be used at all. The "installed" property returns the Date when the AppCache object was first created. This happens before all the cached resources for the AppCache were successfully downloaded. The "installState" is initially "pending" when an AppCache object is created. Once all the resources enumerated in the manifest has been downloaded it changes state to "installed". Once an update is detected and started to be downloaded, the value changes to "updating" until the update has been fully downloaded. The "downloadAvailable" attribute returns true if an update has been detected but is not yet fully downloaded. "downloading" returns true if the appcache is currently in the process of downloading an update for this appcache. It's also true if the appcache is doing the initial download of the appcached resources. The "download" function allows a website to download an update for an appcache object even if the user isn't currently using that appcache. The "cancelDownload" function allows such an update to be cancelled, either if the website itself triggered the download, or if the UA triggered it either by the user browsing to a page that used that appcache, or if it thinks the user is using the cache often enough that it's keeping it up-to-date. The "lastUpdateCheck" attribute and "checkForUpdate" function lets a page trigger a manual update check. This is useful for security conscious pages that want to make sure that an update is rolled out immediately when it's available. Right now some websites hand-roll this functionality by checking using an XHR object to check in with the server. This is basically the whole thing. Below are some additional implementation requirements as well a pile of open questions (some of which are basically a brain-dump so please ask if they are incomprehensible). Implementation requirements: If update fails, don't throw away resources but rather re-attempt to download the missing ones at next update time. Can cache URLs on any server, but never captures cross-origin URLs. The URLs can be cross-origin from the manifest, but not cross-origin from the HTML page that links to the manifest. I.e. the origin of an AppCache is determined by the origin of the HTML page that created it. HTML-pages can't be cross-origin from that. The expiration attribute for the manifest overrides the caching headers for the manifest URL. Allow cross-origin manifest using CORS (or other opt-in) Don't use heuristics for estimating expiration dates for URLs cached in the appcache. Explicit headers are honored (unless overridden by the appcache manifest), but heuristics based on last-modification dates or similar are not allowed. Outstanding questions: * What should happen if an appcache caches index.html, but index.html links to another cache? If it doesn't link to any cache we should probably treat that as if it linked to the cache, but what to do if it explicitly links to another cache? * What should happen if a version property exists and contains the same value, but resources have been added or removed. Or have had their etags/last-modified changed. * Should we add support for "optional urls"? I.e. once which are ok if they fail to download? If so, do we need to specify handling for a failed download (use old version vs. use 404) * Rather than using the map feature to handle cache-busting URLs, should we introduce a list of URLs for which to "ignore query parameters"? * Is a "capture" feature needed? I.e. a list of URLs which if the user navigates to, the appcache should be used. This would have to be a subset of the set of URLs that the appcache has cached or mapped. * the map-to-worker feature could allow setting the worker property to a worker-url in order to support multiple httpworkers. Needed? * How do we solve moving appcache manifests? Can that even be done while also supporting the browser updating the appcache automatically without the user visiting the website? * This is not solving Microsoft's use case of having multiple apps on the same URL. * This doesn't contain a "check network, otherwise use cached resource" ability. Could be added through additional "map" rules if we need it. * Do we need some feature to avoid getting broken sites if we're downloading an appcache just as the app is being updated - how about a notion of "this resource is compatable with manifest etag W/rev-5." either in the representation or it's HTTP headers. * We could add the ability to say "force revalidate" on the urls in cache. * Do we need the captive-portal-detection feature? * How do we support comments in the manifest? One way would be to use some for of extended JSON which supports comments. Another way would be to advocate people sticking properties named "//" in the manifest. * Should the main appcache be considered "ready to use" even if all submanifests are not yet downloaded? This could enable loading one part of a website even if later sections are still loading. [1] http://etherpad.mozilla.org/appcache [2] https://github.com/slightlyoff/NavigationController [3] https://github.com/slightlyoff/DOMFuture / Jonas
Received on Tuesday, 26 March 2013 07:03:57 UTC