Interpretation of Fragment Identifiers from Bjoern Hoehrmann on 2002-10-01 (www-html-editor@w3.org from October to December 2002)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Tue, 01 Oct 2002 11:34:13 +0200
To: www-html@w3.org
Message-ID: <3d9b65a0.3694262@smtp.bjoern.hoehrmann.de>
Hi,

HTML and XHTML allow non-ASCII characters in anchor names (the name
attribute of the a element in HTML, the id attribute in XHTML) but they
fail to define how fragment identifiers must be encoded and decoded to
ensure correct interpretation. Consider e.g. a HTML document with the
following anchor:

  <h1><a name='Bj&ouml;rn'>...</a></h1>

Document internal links to this fragment might look like

  <ol>
    <!--1--><li><a href='#Bj&ouml;rn'  >none nfc</a>  <!-- invalid -->
    <!--2--><li><a href='#Bjo&#x308;rn'>none nfd</a>  <!-- invalid -->
    <!--3--><li><a href='#Bj%f6rn'     >iso-8859-1 nfc</a>
    <!--4--><li><a href='#Bj%c3%b6rn'  >utf-8 nfc</a>
    <!--5--><li><a href='#Bjo%cc%88rn' >utf-8 nfd</a>
  </ol>

Cases 1 and 2 are invalid (non-ASCII character in URI Reference).
However, appendix B.2.1 suggests to fix it like this:

  We recommend that user agents adopt the following convention for
  handling non-ASCII characters in such cases:

  1. Represent each character in UTF-8 (see [RFC2279]) as one or more
     bytes. 
  2. Escape these bytes with the URI escaping mechanism (i.e., by
     converting each byte to %HH, where HH is the hexadecimal notation
     of the byte value). 

This would cause

  Case 1 == Case 4
  Case 2 == Case 5

I tested the above cases in various browsers. The encoding of the
document has been set to ISO-8859-1:

  +-------------+---+---+---+---+---+
  |             | 1 | 2 | 3 | 4 | 5 |
  +-------------+---+---+---+---+---+
  | Mozilla 1.0 | x |   | x | x |   |
  | Opera 6.0   | x |   |   |   |   |
  | Amaya 6.4   | x |   |   |   |   |
  | IE 6.0      | x | x |   |   |   |
  +-------------+---+---+---+---+---+

In a UTF-8 encoded document using 

  <h1><a name='Björn'>...</a></h1>

instead of a named character reference, Mozilla will no longer pass test
case 3, other results remain equal. Using 

  <h1><a name='Bjo&#x308;rn'>...</a></h1>

will change the results as below:

  +-------------+---+---+---+---+---+
  |             | 1 | 2 | 3 | 4 | 5 |
  +-------------+---+---+---+---+---+
  | Mozilla 1.0 |   | x |   |   | x |
  | Opera 6.0   |   | x |   |   |   |
  | Amaya 6.4   |   | x |   |   |   |
  | IE 6.0      | x | x |   |   |   |
  +-------------+---+---+---+---+---+

The test document is always delivered as text/html with a correct
charset parameter.

RFC 3236 (application/xhtml+xml) refers to XPointer for possible future
interpretations of fragment identifiers. Past drafts for XPointer had a
section (4.1.2) [1] on how to deal with non-ASCII characters, after the
draft has been split up into various parts, there is no longer such
section, I wonder whether this is a) conforming to CharMod and b) will
cause problems, since it still is not clear, how to interprete non-ASCII
escapes in URI reference. Another problem is, that the available
resources (mostly drafts) disagree whether NFC normalization must, must
not or should happen and if, when it should happen.

I'd like to see this situation improved already in HTML 4, but at least
for application/xhtml+xml something needs to be done, IMO.

[1] http://www.w3.org/TR/2001/CR-xptr-20010911/#uri-escaping

regards.
Received on Tuesday, 1 October 2002 05:34:09 UTC