- From: Roy T. Fielding <fielding@apache.org>
- Date: Mon, 22 Jul 2002 19:27:47 -0700
- To: www-tag@w3.org
The TAG gave me an action item to describe some of the design history and rationale for fragment identifiers. This is my attempt to write it down in a "few" paragraphs. = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Fragment identifiers have had a long, bawdy relationship to URI that goes back to the very first implementations of WWW addresses. Theoretically speaking, a fragment identifier is a client-side indirect reference via the URI to which it is attached. In other words, a URI reference of the form URI#fragment tells the client that it can access the "thing" identified by the fragment by first performing a GET action (or its equivalent) on the URI to retrieve a representation of that resource, and then pass the fragment portion of the reference to the media-type specific renderer for that representation in order for that viewer to complete the reference. How it completes the reference is dependent on the media-type, but the most common action is to center or focus the renderer's view of the representation on some fragment of the whole representation (hence, the name fragment identifier). Originally, fragment identifiers were defined as part of the URI syntax [WWW Addressing, 1992]: The format of a hypertext name consists of the name of the naming subscheme to be used, then a name in a format particular to that subscheme, then an optional anchor identifier within the document. For example, the format is for all internet-based access methods: scheme : // host.domain:port / path / path # anchor A suffix # anchor id allows one to refer to a particular anchor within a document. A suffix ? followed by words separated by + signs allows one to search an index (see details ). References from one document to another with a similar name may be abbreviated to a relative name. This imposes certain restrictions on the way that the "path" is represented. When WWW addresses were proposed to the IETF for standardization as URI, a vocal subset of the Internet community believed that the anchor syntax was specific to Web-like retrieval actions, and since their software wanted to use identifiers with no option for retrieval, they felt it necessary that anchors not be considered part of the URI syntax at all. TimBL changed the name from anchor to fragment, and from address to identifier, to be more inclusive of the wider community's goals, but the result remained a deadlock within the URI working group of the IETF. TimBL published RFC 1630 as a way of saying "well, we can't wait any longer, so here's what we did and you may choose to adopt it or not." [RFC 1630] fragmentaddress uri [ # fragmentid ] Note that this separated the definition of URI from that of fragmentaddress, which was formerly called documentaddress and later called URI-reference in RFC 2396. Unfortunately, that still left the URI standardization process in the hands of the IETF URI working group, which at the time was convinced that the Web was only one way to do information retrieval and therefore the Internet should not be "limited" to standardization of URI the way they had already been implemented in the Web. After a year of no progress and facing dissolution of the working group, the editors put together a document representing rough consensus on the URL (and only URL), which became RFC 1738. It skipped the whole idea of fragments, except to exclude "#" from URL: The character "#" is unsafe and should always be encoded because it is used in World Wide Web and in other systems to delimit a URL from a fragment/anchor identifier that might follow it. Needless to say, this sort of compromise simply didn't work. None of the deployed Web software implemented RFC 1738. It didn't correspond to how they parsed URI, how they processed URI, or how they interpreted the result of retrieving representations using URI. This was rather frustrating to me, since I was both a believer in the IETF process and an implementer of WWW technology. So, when I wrote the spec for relative URL (RFC 1808), I bypassed the issue and defined the syntax as simply URL = ( absoluteURL | relativeURL ) [ "#" fragment ] Including fragment in the reference syntax is necessary because the protocol elements that made use of this standard, such as HTML "a" href, needed a consistent syntax. Furthermore, as individuals defined new URI schemes or updated the ones in 1738, we found that they either forgot about the fragment issue (leading to interoperability problems), or were doing strange contortions in the BNF to account for both the 1738 grammar plus fragments, or were simply ignoring 1738 and making normative references to 1630. However, people rightly pointed out that making absoluteURL exclude fragment while URL includes them results in some painful thinking whenever URL is used in a sentence. Time passed, and in 1997 we needed to update the URL specifications in accordance with deployed practice. Finally, there was a chance to distinguish between people's opinions about what is desired to be a URI and what it actually meant to be a URI in practice, and produce a specification that defined URI in terms of how they were used within deployed implementations of Internet protocols. The result was RFC 2396, which introduced the notion of URI-reference, which was actually equivalent to TimBL's original definition of documentaddress that was later renamed fragmentaddress, but in a form that was slightly more politically correct. *phew* Right. So, now we have RFC 2396's definition of fragment identifiers, which is actually the least restrictive and most open to media-type specific interpretation of any formal definition that has ever been specified for fragments, with the single exception of TimBL's design notes regarding RDF that were written after 2396 was sent to the RFC editor. The reason we call it a URI that identifies a resource, rather than a UDI that identifies a document, is because we want a URI to reference things in the future -- to point to a source of future useful things. That's what resource means. It is therefore impossible to "retrieve" a resource, since the fact that it is available "over there" is an essential part of it being a resource; the resource remains over there, so the only thing that is retrieved is an instantaneous representation of the resource at the point in time at which it was generated by the origin. Document means a lot of different things to different people, one of which is a bag of bits representing the framework for a renderable page. All uses of the term "document" in RFC 2396 refer to the virtual document described by the retrieved representation of a resource, where the virtual document may consist of multiple individual representations within a single rendering framework (e.g., a web page may consist of HTML, stylesheets, in-line images, etc.). In HTML, a fragment identifies a portion of the complete virtual document, not just the bits within the HTML framework. I'll restate my most recent thinking on the subject from a prior message to www-tag: Fragment identifiers are client-side indirect references, similar to how server-driven negotiation in HTTP allows a server-side indirect reference. The fragment identifier will, if the resource provider has done it right, identify the same thing across multiple representations. Even a resource mapping to static content will have multiple representations over time -- they will all be byte-equivalent, but not age-equivalent. Thus, if the resource provider has done it right, a fragment identifier can be used to consistently define a "thing" similar to a resource. We do not, however, call that "thing" a resource because it simply is not available on the WWW interface as a resource -- the WWW does not and never has treated the fragment identifier under the same rules of processing as the resource identifier, since doing so would interfere with the intent and result of client-side indirection. Retrieval and name-equivalence are the only two "actions" allowed on a fragment because the interface does not allow a fragment identifier to traverse the architectural boundary between client and server. Fragments are not first-class resources; not even when they consistently identify semantics across multiple representations. The aspect of fragments that is media-type-specific is the mechanism of the indirect reference when it is dereferenced. The mechanism is not known (and cannot be known) until a representation is in hand. That is, either the fragment identifier is used in a same-document reference or an action equivalent to GET is performed on the URI preceding the fragment in order to obtain that representation. The representation, once in hand, determines what needs to be done to complete the retrieval action. Finally, there is the issue of same-document references: URI references that do not contain a URI, and therefore are interpreted as referring to the representation in hand. They do not refer to the resource from which that representation was obtained, regardless of the media type: such an interpretation simply cannot be supported in any information system that allows resource content to change over time. Allowing a media type to assert that a same-document reference is applicable to the resource identified by the URI used to retrieve that representation is equivalent to saying that the media type cannot be delivered through gateways, cannot be archived, and cannot be versioned. There are no such media types on the Web. BTW, none of this is specific to HTTP. Cheers, Roy T. Fielding, Chief Scientist, Day Software (roy.fielding@day.com) <http://www.day.com/> Chairman, The Apache Software Foundation (fielding@apache.org) <http://www.apache.org/>
Received on Monday, 22 July 2002 22:27:56 UTC