Re: fragment syntax from Foteos Macrides on 1997-10-28 (uri@w3.org from October 1997)

From: Foteos Macrides <MACRIDES@sci.wfbr.edu>
Date: Tue, 28 Oct 1997 11:07:06 -0500 (EST)
To: fielding@kiwi.ics.uci.edu
Cc: uri@bunyip.com, asgilman@access.digex.net
Message-id: <01IPC6B1XMG200CTNV@SCI.WFBR.EDU>
"Roy T. Fielding" <fielding@kiwi.ics.uci.edu> wrote:
>> 1.  The use of ## for special anchors seems reasonable.
>
>Use of more than one "#" character is illegal and not desirable
>in the current URI syntax.
>
>We have discussed this same topic many times on the www-talk and uri
>lists, and the conclusion is always the same:

	The reason for confusion about this stems from changes in
RFC 1808 and the current URL drafts relative to earlier RFCs.

	In RFC 1630 and RFC 1738 it was stated explicitly that there
can be only one or no unescaped '#' associated with a URL, and if one
is present, it is punctuation for a fragment (and not part of the
actual URL), whereas any '#' which is not MUST be hex escaped.  They
said nothing about directionality of parsing for a fragment, because
it's irrelevent under those circumstances.  The vanilla libwwws and
most (all?) versions of Netscape parse right-to-left for a '#',
presumeably because if present it is likely to be closer to the end
of the URL+fragment string than the beginning, and some overhead is
saved.

	RFC 1808 and the subsequent URL drafts specify that the
parsing should be left-to-right, and do not state that any '#'
which is not punctuation for a fragment must be hex escaped.  As
a result, many have (mis?)interpreted them to mean that unescaped
'#' characters can be present to the right of the first '#' in a
URL+fragment string, and thus that use of multiple '#' characters
for "special anchors" was made possible.  If that's not intended,
perhaps a reason for specifying the direction of parsing, and for
omitting the pre-RFC 1808 explicit statements about hex escaping
*all* other '#' characters, should be added to the URL draft.

	Note that MSIE parses left-to-right for the '#', and Lynx
changed to doing that several releases ago.  Also, it appears that
some people who have lost sight of, or perhaps never understood,
what the Web is all about do things like putting NAME="#blah"
attribute name/value pairs in Anchors, so that the corresponding
fragment will become ##blah, and Netscape (with its still
right-to-left parsing) will be tripped up.  Ugh!

	Note also that though the URL RFCs and drafts allow a
variety of unescaped characters in the fragment, the SGML/HTML
specifications for NAME and ID attributes preclude using several
of them in that context, but no browser, to my knowledge, pays
attention to the latter restrictions.  Nor do authoring tools,
so that in documents written with those tools by naive authors
who are counting on those tools to "do the right thing" on their
behalf, you'll often see characters in NAME attribute values
which are illegal, and thus browsers must continue handling them
as if legal.


>   1) fragment identifiers are dependent on the media type of the
>      entity retrieved;
>
>   2) fragment identifier syntax should be registered with the media
>      type registration;
>
>   3) the "=" character should be used as an indicator for a non-name
>      syntax, as in
>
>          #name        (as in current HTML use)
>          #id=fred
>          #bytes=200-254
>          #words=20-24
>          #line=4
>          #chapter=14
>          #page=3
>
>The only thing that prevents this right now is the uncertainty about
>how to register this along with a media type, and some volunteer to
>look at all the current media types and define a list of appropriate
>ones for the initial registry.

	Note that Al apparently misinterpretted the above in his
comments about Lynx behavior.  Lynx treats ID attributes homologously
to NAME attributes, so, for example, <P ID="id=fred">blah</P> will
allow use of #id=fred as a fragment for seeking that paragraph (even
though the '=' is invalid in that context).  It does not treat the
'=' as an indicator for a non-name syntax (because there is no such
application convention as yet :).  What Al is seeking is a homolog
for text/plain documents, within which any markup with NAME or ID
attributes would not be interpreted.  Perhaps a covenention like
#seek=string meaning unescape "string" and seek its first occurrence
in the document would work in theory, but it could get hairy if you
don't restrict it to text/plain documents, and even then you'd need
something more to deal with possible variations in charset so that
the implementations would be interoperable (and not just another
Lynxism :).

				Fote

=========================================================================
 Foteos Macrides            Worcester Foundation for Biomedical Research
 MACRIDES@SCI.WFBR.EDU         222 Maple Avenue, Shrewsbury, MA 01545
=========================================================================
Received on Tuesday, 28 October 1997 11:10:37 UTC