fragments are available for overloading from Al Gilman on 1997-10-25 (uri@w3.org from October 1997)

From: Al Gilman <asgilman@access.digex.net>
Date: Fri, 24 Oct 1997 21:10:09 -0400 (EDT)
To: kweide@tezcat.com (Klaus Weide)
Cc: uri@bunyip.com
Message-Id: <199710250110.VAA24515@access4.digex.net>
to follow up on what Klaus Weide said:

> On Fri, 24 Oct 1997, Al Gilman wrote:
> 
> > [response 1 -- fragments as we know them]
> > 
[snip]
> > Note that the "fragment" in this syntax symbol should therefore
> > be read as "this fragment of the textual URL-reference," not "the
> > cited fragment of the referenced resource" even though in the
> > common case of http: URLs referring to HTML documents either
> > interpretation fits.
> 
> Can you explain what you mean with that last paragraph?
> I don't understand the distinction you are trying to make here,
> or why the two sentences are connected by "therefore".
> 

Let me try to clarify.

To parse an URL reference you can start at the right and search
from right to left until you find a # character.  If you find
one, you take the # character and everything you found to the
right of it (what you found after it is referred to collectively
by the syntax symbol "fragment") and set it aside for client-side
operations on the retrieved resource.  What you have left should
be the URL of a resource.  What you have set aside is a TBD.  You
apply the scheme-specific magic to access the resource and make
the resource and fragment available to the client program.  Once
the client program determines the media type of the resource, it
can interpret the fragment.  Not before.

The only common use that I know for such a fragment is to refer
to a point in an HTML document at the start of an anchor whose
name attribute (or more recently any element whose ID value)
matches the string in the fragment.

Even though the current draft does use the phrase "fragment
identifier" for this string, its syntax is simply a sequence of
legal characters and its format and interpretation is dependent
on the media type of the resource.

That is the key clause which is the basis for my "therefore."

That is why I said that from the URL spec we know the
relationship of this substring to its enclosing URL-reference
string, but we _don't_ know from the URL spec how it relates to
the resource it points into.  In particular, it is not limited to
being an identifier token, and it could be a whole clause in a
media-type-related sublanguage, comparable to the ?searchpart
clause.

If URNs don't like the fragment rules that go with URLs, it is
probably because they are not really accessing the same media
types.  If you can't index into a resource after retrieving it by
an URN with the same semantics as if you retrieved it with an
URL, you just need to define a new object class with appropriate
interior access methods and register that class as an internet
media type and voila -- URNs with fragments.

I talked about objects and classes as opposed to files and types
because what is defined here is an access method -- a method of
accessing a logical position or feature within the resource.

For example, one could define an HTML class in which you can make
arbitrary whitespace-for-whitespace replacements in the file and
the value of the HTML object still compares as equal.  The HTML
spec effectively says you should treat HTML as this class.

For this class of HTML objects, line number references are not
stable, but you can define a string compare [with the right
whitespace normalization] on this class that yields stable values
of "first successful match."

[This is getting long, but you might be interested in some
related concerns...]

One of the areas that is underdeveloped in the Web media is
numbering.  Try printing the PDF of the HTML4 spec to paper and
then using it.  The indices don't work because there are no page
numbers in them.

The CSS working group is working on numbering topics because
there are class of things -- sections, notes, etc. that want to
get marked with serial numbers in printing documents whereas the
numbers may wish to be suppressed in online display.

The Web Accessibility Initiative particularly cares to see that
this work progress because

	- Braille versions of documents want to be able to
	cross-reference the page number in the parallel printed
	version.

	- Braille and low-vision users of documents want styles
	based on section numbering and no indents, as opposed
	to indents and no numbers.  So the ability to autonumber
	without a lot of support from the author is valuable for
	this user group.

This means that we need to do better at clarifying what are the
intrinsically serial classes of entities in HTML even when they
are not LI elements in an OL collection.

Once the serializable class of HTML content-items is identified,
these series become a basis for reliable indexing into a
document.  "The point in the text where there is a reference to
note number 113" is a locator that just might always get you to
the right resource in a single-html-file document where the note
numbers will be repeatable.  There are ways to break that if
there are notes in conditionally included hypertext, as with
OBJECT.

But that is what I mean by an indexing method associated with an
object class.  You need to know what sets of interior features
are always going to come up at the same number when serialized
and which won't.  Doing the semantic analysis for optional
numbering in stylesheets gets us part of the way there.

Hope this helps.  I don't think that the URL spec ties the hands
of the URN inventors to any significant degree.

-- Al
Received on Friday, 24 October 1997 21:11:20 UTC