Matrix URIs and RESTful cookies. from Eric J. Bowman on 2006-04-13 (uri@w3.org from April 2006)

From: Eric J. Bowman <eric@bisonsystems.net>
Date: Thu, 13 Apr 2006 20:00:17 +0000
To: uri@w3.org
Message-ID: <W58344571561581144958417@mail.mailsnare.net>

Are you _sure_ cookies are _never_ in accordance with REST architectural principles? Or am I just flat wrong? ;-)

Hello!

We are developing a bandwidth-and-CPU-optimizing "Representation Management System" intended as a wrapper layer for legacy Content Management Systems, we define 'legacy' to include plenty of current and popular (mostly PHP) CMS scripts. I've provided a general overview of the project's ambitious nature to give some perspective:

http://www.iwdn.net/showthread.php?p=44544#post44544

We solved two major URI aliasing problems during development by implementing Matrix URIs as inspired by:

http://www.w3.org/DesignIssues/MatrixURIs.html

The first problem is how to deal with the URL aliases inherent in most content-negotiation setups. The second problem was how to prevent each parameter from becoming an URL alias, while adhering to the precepts of REST.. By implementing Matrix URIs we have enabled a generic skin-switching system common to any hosted CMS. I mean really switch the skin, not just provide alternate CSS stylesheets, even to the extent of making a site available in a different markup language like SVG.

Just adding an appropriate .svg file for each .html page name isn't the right solution (although this works for .pdf), because filename extensions are what create the aliases in the first place. The simplest example I can give, is if I make an URI for a resource which happens to be a photo of myself and call it /eric, then set up content negotiation between representations labeled eric.jpg and eric.png, I wind up with three URLs for what I intended (99% of the time) to be only one resource, "picture of me".

Which means that each representation is also a resource in its own right, since it may be independently linked to with its filename extension. But that's the problem. What if someone links to /eric.png instead of /eric in a post, and that link is dereferenced by a client which doesn't ACCEPT .png graphics or which prefers .jpg? Mayhem ensues, or appears to if one sets about to do any real-world tests on different client-server combinations, defeating the point of using content negotiation for this purpose.

What is the proper response in such a situation? Return 200 OK at /eric.png with /eric.jpg if the client accepts .jpg but not .png, or present a 300 OPTIONS list? Or 301-redirect to /eric.jpg? If the intent of the link is that /eric.png is a representation, shouldn't the request be 301-redirected to /eric and let the server negotiate the proper representation of the resource? Oooh, but then we're assuming the link to /eric.png wasn't deliberately intended by the author for the purpose of discussing an alpha-level transparency issue affecting the eric.png resource, but not the other possible representations of /eric -- even if a client's q-value is set lower for .png than .jpg.

It's a dilly of a pickle, and I'm glad I'm not using content negotiation for image formats because my solution of doing away with file extensions almost entirely, wouldn't work very well for binary formats. I'm sorry if I've gotten too detailed, but I think it's best to start with image files as an example of the general nature of this problem. One goal of our software is to serve all output as XHTML 1.1 to compliant devices and HTML 4.01 to those which aren't, with two further representations available for WAP 2 handhelds and nontraditional devices as a subset of each main category. We keep complexity in check by using XSLTC to derive all representations from a single resource.

http://canuck.bisonsystems.net/bisonweb/index.html <-- XHTML 1.1
http://canuck.bisonsystems.net/bisonweb/index.htm <-- HTML 4.01

Those are static example files (don't expect much) to demonstrate the problem, using a different filename extension for each representation which is exactly what I want to avoid. The canuck server doesn't negotiate, the next link does. I am using content negotiation for device sorting. There are four buckets. Version 4 and older browsers aren't sorted into any bucket, rather they receive 505 responses since not using VARY headers isn't an option at all here and we MUST NOT send HTTP 1.1-specific responses to HTTP 1.0 clients (intermediaries, fine). Modern desktop browsers are sorted into the 'application/xhtml+xml' bucket, which has a smaller bucket inside just for XML-compatible handhelds.

The HTML 4.01 bucket is for IE (and some others like Camino), as well as browsers which aren't old enough to 505 or new enough to grok XML, while the default 'other' bucket encompasses non-XML handhelds, bots, true oddballs like WebTV, and other devices like screen readers which are all likely to be happier with vanilla HTML sans DOCTYPE declaration and <img> tags, with <link>s to a wider selection of alternate CSS stylesheets than the representations intended for visual rendering on desktop browser screens.

The CMS output is filtered through TagSoup then XSLTC (using a CMS-application-specific stylesheet) to transform it into a common internal format (atom), then it's filtered through XSLTC a second time to transform it from atom into the final result (using a domain-specific stylesheet). Our buckets are labeled 'xhtml', 'html', 'mobile' and 'text'. The second-stage XSLTC transformations use source files labeled xhtml.xsl, html.xsl, mobile.xsl and text.xsl. A VARY:ACCEPT, USER-AGENT header is slapped onto a cached output stream, transformations are never written to disk.

Therein lies the problem. That VARY header relates thousands upon thousands of possible combinations to only a handful of actual representations in any given cache. The more obscure the client and the closer it is to the network edge, the less chance that its header 'combination' is MRU enough to unlock the document even if it does exist in an intermediary host.. It is this less-than-desirable foundation upon which my VARY:COOKIE solution is built, such that the only side effect of enabling cookies is to streamline caching and allow clean toggling between representations while the only effect of disabling cookies would be to radically decrease the chances of a cache hit.

>From the perspective of Googlebot, the site's text-only representations are followed and indexed without following (or encountering) any links with ;parameters (like ;view=print), which are disallowed from indexing. >From the perspective of a user clicking through from a Google results page, the representation served will most likely (as a result of content negotiation) be either the XHTML 1.1 or HTML 4.01 graphical representation meant for desktop browser screens. Images are kept out of the search-index caches this way, which some people may not want, but oh well.

Obviously, the first access of anyone to the site will result in content negotiation and hopefully a cache hit based on VARY:ACCEPT, USER-AGENT. But that's a landing page. What about a deep link? If VARY:COOKIE is set and the cookie is used to cache the result of content negotiation, i.e. view=(xhtml|html|mobile|text), then those four possible representations are represented in any cache by only four possible pointers. This makes VARY:COOKIE representations much less susceptible to LRU expiration than any of the VARY:ACCEPT, USER-AGENT combinations. Most browsers seem to accept cookies, so if most browsers are using the same cache headers for a representation, that representation's likelihood for cache retention ought to be reinforced.

Am I off in outer space here? I dunno, I have a working example, with the caveat that it doesn't work as nicely right now as it did when we were setting expiration values on the cookies instead of using a cache-control header. But please take my word that this unstable-as-yet demo does represent proof-of-concept which I'd like to defend, or be convinced not to use -- I'm worried that despite VARY:HEADER many caches will refuse to hold representations with cookies set.

For most representations, we use a 303-redirect to set the cookie and clear the parameter. Try it yourself, but remember to (be gentle on the server installation and) use your imagination because we haven't written the second-stage XSLT code, so you'll have to picture the skins on the 'canuck' example links above as the output here.

http://bisonsystems.info/

This will cleanly and reliably negotiate your client into the appropriate bucket. At this time, the only way to tell is to read the page title, even though changing the page title between representations won't be done (why so many print pages on the web are titled 'print this page' instead of the article title, I'll never know) "for real", and to look at the MIME type of the output. Add ".xsl" to the page title and you can tell which transformation executed. Don't worry about cookies-disabled behavior, it'll work in the next revision, but it doesn't right now.

If you're using Firefox, Opera, Safari or similarly capable browser and view source, you'll have to imagine that you see the right DOCTYPE and the code from the XHTML 1.1 example above, but the MIME type will be correct and the XML well-formed. If you're using IE or another less-capable browser, you'll have to imagine that the document you are viewing resembles the HTML 4.01 example above, complete with the conditional-comment tag and ie.css file that I don't want to serve to any more clients than absolutely necessary. Notice VARY:ACCEPT, USER-AGENT is set.

http://bisonsystems.info/wiki_sandbox

Go ahead and dereference the sandbox link, and notice VARY:COOKIE is now set. Now try entering /wiki_sandbox.pdf and imagine we were using an actual XSL-FO file, then hit 'back' and watch sandbox load from browser cache based on the cookie settings, after receiving a 304 response (well, it was doin' that the other day...) from the origin server. If you're using Opera, etc. you can try /wiki_sandbox;view=html to see the IE representation while IE users can try /wiki_sandbox;view=xhtml and see an error message. I think ;view=text and ;view=mobile may even work at this time.

I don't understand why IE isn't asking to download the 'application/xhtml+xml' file, or why no browsers ask to download the .pdf file, but it's probably because that isn't really XHTML or PDF in the document body of the output yet. What matters to me right now, is that the proper transformations run and the output has the proper MIME type. We have some other options which currently show how frayed around the edges our demo is but also where we're going, like ;view=edit (puts content into <form>s), ;view=xml (xml.xsl is an identity template which exposes our internal atom markup as a feed) or ;view=print.

The examples so far have been one-dimensional arrays of representations, hardly worth claiming as a Matrix URI implementation. Except when you consider what will happen (soon) when you enter this:

http://bisonsystems.info/wiki_sandbox;skin=alt1;view=print;page=thread

Imagine for a moment we're dealing with a weblog entry, not a wiki page. The above URL will 303-redirect to:

http://bisonsystems.info/wiki_sandbox;view=print

Which is the 'print this page' representation using the alternate skin with the comments in threaded order with indentation. These cookied values could be construed as "state" except they're common amongst user groups rather than individually tailored, and represent fixed representations. Appending ;page=flat at this point will 303-redirect back to /wiki_sandbox;view=print except with the comments in sequential order without indentation. Appending ;skin=main would 303-redirect back to /wiki_sandbox;view=print except with the default skin not the alternate skin. All this does is choose the path of either /main/ or /alt1/ to find the required .xsl files, meaning we can have a rather large matrix of skin options for any site labeled most anything -- provided that spiders only index the resource URLs, never any link with a ;parameter like ;view=print.

Regardless of skin preference, removing ;view=print or using the back button on the client will return the viewer to the resource URL /wiki_sandbox. I don't think my use of cookies amounts to storing client state contrary to REST, but rather, bookmarking representations for the purpose of more efficient caching, in accordance with REST. The other alternative with Matrix URIs is to have all representations respond with a 200 OK, which sounds like aliasing to me since real-world clients think the parameters represent separate, discrete resources just like they were filename extensions or ?-based query strings. I'd like hosted sites not to have cluttered Google results or have them get high "duplicate removal" scores due to such aliases.

As with my image-file example, I want the option of overriding content negotiation in order to expose each representation as a resource in its own right. Just because I'm using Opera doesn't mean I wouldn't ever want to look at the HTML 4.01 representation, or the WAP version, or explicitly link anyone else to them, just like /eric.png in the example. By using cookies, I am also reducing request latency on the origin server (particularly for mere cache-validation requests) as the content negotiation CPU cycles are bypassed if a cookie is set.

Without cookies, implementing Matrix URIs gets really ugly really fast. By treating cookies disabled as the exception rather than the rule, and basing the cache strategy on representational cookies, the implementation falls into place nicely for the majority of real-world clients even on large, highly-complex sites.

Any comments, other than I posted this in the wrong place or it's too long to read or I'm nuts? ;-)

Reality check please!

Eric J. Bowman, principal
Bison Systems Corporation

Received on Friday, 14 April 2006 00:57:25 UTC