Re: Hyperlinks and content negotiation

Smylers wrote:
> Mike Kelly writes:
>
>   
>> HTML does not currently provision for hyperlinks to indicate a
>> specific content-type preference for the Accept header of a given
>> request.
>>     
>
> That's true, but URLs are distributed in many ways other than as
> hyperlinks in HTML documents, most of which don't have any way of
> indicating that.
>
>   
>> This is an important feature for developers who wish to leverage HTTP
>> content-negotiation,
>>     
>
> Surely even if HTML provided for this such developers would still be
> hampered by URLs being passed around without content types (and users
> not being used to them).  For example URLs are commonly communicated
> via:
>
> * plain-text e-mail messages
> * instant messaging and Twitter messages
> * URL-shortening services
> * adverts on the side of buses
> * T-shirts
>
> All of the above involve either a browser being passed a URL (sans
> content type) from an external application, or the URL being entered by
> a user.
>   
> Any site which requires a certain content type to be supplied to serve
> the desired content will be serving the wrong content.
>   
A 'cold start' request to a URI out of the context of a particular 
application flow should revert to the UA's generic Accept header. The 
significance of a URI is to identify a resource. There isn't a situation 
where the server risks serving the 'wrong' content in response to a 
given request, provided server's conneg logic is sensible.

There should not be too much confusion for a user if clicking a URI (or 
entering it into a location bar manually) causes a browser window to 
load an HTML page. A good server side implementation should be aware of, 
and appropriately accommodate for use cases in which a user may wish to 
progress from the default browser 'landing page' (HTML representation) 
to other available formats - these kinds of implementations are more 
than likely to be aware of this since conneg'd representations are 
involved in their design.

I do agree that there would be a plain-text URI issue if the sender of 
the URI wished to specify a 'non-default' representation - however, this 
*trade-off* against the benefits should be at the developers discretion 
and in the context of a particular system - right now the choice is 
taken out of their hands. There are also client side solutions to this 
that could be introduced over the longer-term to mitigate this problem.

> This feature would also break bookmarks: a user could bookmark a page's
> URL, believing that the URL identifies that page, yet on later visiting
> that bookmark being served different content.
>
>   

The browser should have the request object available when storing a 
bookmark, and could easily solve this type of issue by storing bookmarks 
as HTML documents.

>> ... and require HTML hyperlinks that specify requests to URIs with a
>> specific Accept header preference. There are use cases in which the
>> distinction between a resource's representations are relevant to the
>> flow of an html driven application, e.g. the difference to a browser
>> between an atom and an html representation of a blog resource.
>>
>> <a href="/blog" type="text/html">My blog (HTML)</a>
>> <a href="/blog" type="application/atom+xml">My blog (Atom Feed)</a>
>>     
>
> Many blogs seem to manage with different URLs for their HTML content and
> their feeds, so this requirement can't apply to all blogs in general.
> Please could you clarify precisely the situation which leads to this
> requirement, where two separate URLs wouldn't work?
>
>   

Apologies, maybe the paragraph below this is not clear enough - It is 
not that using separate URIs "doesn't work", just that it may be a 
sub-optimal for a particular system that would benefit more from a 
strictly standardized distinction between resources and representations. 
A clear distinction between the two allows intermediaries to make 
valuable, automated assumptions about the significance of a request. 
Importantly - these assumptions are taken in light of the definitions 
outlined in the HTTP spec; increasing interop, and removing coupling 
between components.

>> Without a formal mechanism in HTML which can specify to UAs the
>> contextual content-type preference for a given hyperlink, HTML is not
>> a viable hypermedia format for systems which must rigorously leverage
>> HTTP conneg - this /could/ be achieved with representation specific
>> URIs (i.e. format 'suffixes', URI parameters etc.) but there are
>> situations in which conneg is a superior solution, particularly in
>> terms of the system as a whole, taking into account intermediaries
>> such as caches.
>>     
>
> In what way does it help for a cache to cache a blog's homepage and feed
> labelled with the same URL compared with caching them with separate
> URLs?  A client retrieving one of them doesn't care whether the other
> one happens to be cached; surely from the cache's point of view they are
> entirely independent?
>
>   

The benefits are realized in terms of automated cache invalidation.

Modifying a resource should automatically invalidate all of its 
representations. 
(http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html#sec13.10)

In a server side reverse proxy cache scenario (a common use case for 
large scale web applications); being able to rely on this automatic 
mechanism as a sole method of cache invalidation ensures that the cache 
is refreshed as infrequently and simply as possible, and that 
destination server usage is kept to a minimum. This kind of efficiency 
gain can dramatically reduce operating costs; particularly true in new 
'pay-as-you-process' elastic computing infrastructures.

If representations are treated as resources then an automatic cache 
invalidation mechanism is not viable and must be coupled to a specific 
application. E.g.:

What, from the perspective of a cache invalidation mechanism, does

POST /blog.html

mean for the other 'representations'

/blog.atom
/blog.rss

..? Nothing! Because a cache will not recognize these are 
representations of the same resource since they are each identified as 
separate resources and given their own URI.

If conneg is used, visibility is greatly increased and the cache can 
automatically invalidate all of the representations. E.g:

POST /blog
Content-Type: text/html
....

would invalidate:

/blog
Content-Type: application/atom+xml
Content-Type: application/rss+xml
Content-Type: application/json

>> It seems a shame that this, perfectly valid, use of HTTP is not
>> allowed to system developers that must implement HTML driven
>> applications.
>>     
>
> If HTML were to provide for this, it still wouldn't be usable because of
> the uses of URLs outside of HTML.  As such, implementing this feature
> would be a disservice to HTML developers, misleading them into thinking
> it's viable, whereas actually using separate URLs works better.
>   

It's not a perfect solution to all problems - it's a trade-off.

If highly-efficient automated caching is more valuable to your system 
than being able to avoid the highly risky world of plain text URIs and 
grumpy twitter users, then there is an obvious choice to be made. This 
trade-off can only be made in context, it doesn't make sense to try and 
govern this via the HTML5 spec.

>> Furthermore - it does not seem that potential enabling solutions would
>> cause incompatibility with existing HTML applications currently not
>> concerned with conneg.
>>     
>
> Existing deployed browsers don't have this feature.  If a developer were
> to use HTML like you suggested above it may then work for him in his
> browser, while making his blog's feed URL completely unavailable to
> anybody with an older browser.
>   

True, but my point was actually that if browsers suddenly began using 
the type attribute to modify their accept header - that shouldn't break 
any existing application.

- Mike

Received on Saturday, 17 October 2009 02:12:13 UTC