Re: Why hexify fragments? from Chris Lilley on 2005-03-23 (www-i18n-comments@w3.org from March 2005)

From: Chris Lilley <chris@w3.org>
Date: Wed, 23 Mar 2005 20:24:28 +0100
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: www-i18n-comments@w3.org
Message-ID: <9373338.20050323202428@w3.org>

On Wednesday, March 23, 2005, 7:54:44 PM, Bjoern wrote:

BH> * Chris Lilley wrote:
>>Yes, you are right for the case where the IRI is converted to a URI and
>>stored in the XML. I was thinking of the case where the IRI is stored
>>directly in the XML and only hexified to cross the wire. But then I
>>suppose its not "A new URI format" in that case .... or is it?

BH> That's indeed not new resource identifier syntax, but I think such
BH> protocol interactions are really orthogonal to the requirement. It
BH> is for new URI syntax which requires that encoded character strings
BH> be represented in a way compatible with URI syntax which requires
BH> the use of %xx escapes if the conversion algorithm yields in octets
BH> not representable using characters allowed in URIs. Remember that
BH> the components in URIs and IRIs represent octets, not characters,

Yes, I remember that, although RFC 3986 was supposed to tighten that up a
little. Section 2.5.  Identifying Data does discuss it a little, but its
still clearly octets.

BH> so

BH>   data:text/plain;charset=utf-7,Bj+APY-rn
BH>   data:text/plain;charset=utf-8,Bj%C3%B6rn
BH>   data:text/plain;charset=utf-8,Björn

BH> are legal IRIs that resolve to the same resource, but

Yes

BH>   data:text/plain;charset=utf-7,Björn
BH>   data:text/plain;charset=utf-8,Björn

BH> while legal IRIs, do not.

BH> The same is true for fragment identifiers,
BH> you could create a media type for which fragment identifiers do not
BH> use UTF-8 / %xx-encoding, e.g., for application/x-foo-xml

You could, at the risk of not conforming to

>>> C060 [S] Specifications that define new syntax for URIs, such as a
>>> new URI scheme or a new kind of fragment identifier, MUST specify
>>> that characters outside the US-ASCII repertoire are encoded using
>>> UTF-8 and %HH-escaping.

which is where we came in.....

BH> and 

BH>   <!DOCTYPE foo [<!ATTLIST foo id ID #IMPLIED>]>
BH>   <foo id = "Björn" href = "#Bj+APY-rn" />

BH> you can require that the IRI Reference in href refers to <foo> as
BH> identified by the ID in id as the fragment identifier syntax for
BH> application/x-foo-xml is based on UTF-7 rather than UTF-8. So the
BH> requirement is relevant even if no %xx escaping is involved.

Yes, I agree.

>>Yes, thats a good URI test. I will add it to the test suite.

BH> Great!

I have actually had a lengthy response to an email from you on the same
subject, dated July 23, 2003, 5:37:06 AM, in my 'drafts' folder for the
longest time. Since then IRI has been published so the answer would now
be shorter and clearer than it was.

-- 
 Chris Lilley                    mailto:chris@w3.org
 Chair, W3C SVG Working Group
 W3C Graphics Activity Lead

Received on Thursday, 24 March 2005 02:55:57 UTC