- From: Al Gilman <asgilman@access.digex.net>
- Date: Fri, 24 Oct 1997 21:10:09 -0400 (EDT)
- To: kweide@tezcat.com (Klaus Weide)
- Cc: uri@bunyip.com
to follow up on what Klaus Weide said: > On Fri, 24 Oct 1997, Al Gilman wrote: > > > [response 1 -- fragments as we know them] > > [snip] > > Note that the "fragment" in this syntax symbol should therefore > > be read as "this fragment of the textual URL-reference," not "the > > cited fragment of the referenced resource" even though in the > > common case of http: URLs referring to HTML documents either > > interpretation fits. > > Can you explain what you mean with that last paragraph? > I don't understand the distinction you are trying to make here, > or why the two sentences are connected by "therefore". > Let me try to clarify. To parse an URL reference you can start at the right and search from right to left until you find a # character. If you find one, you take the # character and everything you found to the right of it (what you found after it is referred to collectively by the syntax symbol "fragment") and set it aside for client-side operations on the retrieved resource. What you have left should be the URL of a resource. What you have set aside is a TBD. You apply the scheme-specific magic to access the resource and make the resource and fragment available to the client program. Once the client program determines the media type of the resource, it can interpret the fragment. Not before. The only common use that I know for such a fragment is to refer to a point in an HTML document at the start of an anchor whose name attribute (or more recently any element whose ID value) matches the string in the fragment. Even though the current draft does use the phrase "fragment identifier" for this string, its syntax is simply a sequence of legal characters and its format and interpretation is dependent on the media type of the resource. That is the key clause which is the basis for my "therefore." That is why I said that from the URL spec we know the relationship of this substring to its enclosing URL-reference string, but we _don't_ know from the URL spec how it relates to the resource it points into. In particular, it is not limited to being an identifier token, and it could be a whole clause in a media-type-related sublanguage, comparable to the ?searchpart clause. If URNs don't like the fragment rules that go with URLs, it is probably because they are not really accessing the same media types. If you can't index into a resource after retrieving it by an URN with the same semantics as if you retrieved it with an URL, you just need to define a new object class with appropriate interior access methods and register that class as an internet media type and voila -- URNs with fragments. I talked about objects and classes as opposed to files and types because what is defined here is an access method -- a method of accessing a logical position or feature within the resource. For example, one could define an HTML class in which you can make arbitrary whitespace-for-whitespace replacements in the file and the value of the HTML object still compares as equal. The HTML spec effectively says you should treat HTML as this class. For this class of HTML objects, line number references are not stable, but you can define a string compare [with the right whitespace normalization] on this class that yields stable values of "first successful match." [This is getting long, but you might be interested in some related concerns...] One of the areas that is underdeveloped in the Web media is numbering. Try printing the PDF of the HTML4 spec to paper and then using it. The indices don't work because there are no page numbers in them. The CSS working group is working on numbering topics because there are class of things -- sections, notes, etc. that want to get marked with serial numbers in printing documents whereas the numbers may wish to be suppressed in online display. The Web Accessibility Initiative particularly cares to see that this work progress because - Braille versions of documents want to be able to cross-reference the page number in the parallel printed version. - Braille and low-vision users of documents want styles based on section numbering and no indents, as opposed to indents and no numbers. So the ability to autonumber without a lot of support from the author is valuable for this user group. This means that we need to do better at clarifying what are the intrinsically serial classes of entities in HTML even when they are not LI elements in an OL collection. Once the serializable class of HTML content-items is identified, these series become a basis for reliable indexing into a document. "The point in the text where there is a reference to note number 113" is a locator that just might always get you to the right resource in a single-html-file document where the note numbers will be repeatable. There are ways to break that if there are notes in conditionally included hypertext, as with OBJECT. But that is what I mean by an indexing method associated with an object class. You need to know what sets of interior features are always going to come up at the same number when serialized and which won't. Doing the semantic analysis for optional numbering in stylesheets gets us part of the way there. Hope this helps. I don't think that the URL spec ties the hands of the URN inventors to any significant degree. -- Al
Received on Friday, 24 October 1997 21:11:20 UTC