Re: [check] Bug 66 from Bjoern Hoehrmann on 2002-12-04 (public-qa-dev@w3.org from December 2002)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Wed, 04 Dec 2002 21:23:55 +0100
To: Nick Kew <nick@webthing.com>
Cc: public-qa-dev@w3.org
Message-ID: <3dfa606e.7545860@smtp.bjoern.hoehrmann.de>

* Nick Kew wrote:
>> I am :-) See section 4.2.2 of XML 1.0,
>> http://www.w3.org/TR/REC-xml#dt-sysid
>> 
>> [...]
>>   * Each disallowed character is converted to UTF-8 [IETF RFC 2279] as
>>     one or more bytes.
>
>I don't see any disallowed character under #2.1 of rfc2396 in your
>testcase.

There is a U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS), URIS most not
contain non-ASCII characters, see e.g. section 2.4 of RFC 2396.

>Your testcase was declared as iso-8859-1, so escaping as UTF-8 is
>at best perverse, and breaks commonsense.

Well, it's a normative requirement of XML 1.0, and actually using UTF-8
for disallowed characters in URIs *is* commonsense, you'll find a
similar section in HTML 4.01 (appendix B.2.1), take a look at 

  http://www.w3.org/International/O-URL-and-ident.html

The requirement in XML 1.0 is however only for the system identifier and
there are XML processors that implement it the way the recommendation
wants it, e.g.

  http://www.stg.brown.edu/service/xmlvalid/

Received on Wednesday, 4 December 2002 15:23:11 UTC