Re: [check] Bug 66 from Nick Kew on 2002-12-04 (public-qa-dev@w3.org from December 2002)

From: Nick Kew <nick@webthing.com>
Date: Wed, 4 Dec 2002 20:04:44 +0000 (GMT)
To: Bjoern Hoehrmann <derhoermi@gmx.net>
cc: public-qa-dev@w3.org
Message-ID: <Pine.LNX.4.21.0212041941120.1129-100000@jarl.webthing.com>

On Wed, 4 Dec 2002, Bjoern Hoehrmann wrote:

> 
> I am :-) See section 4.2.2 of XML 1.0,
> http://www.w3.org/TR/REC-xml#dt-sysid
> 
> [...]
>   * Each disallowed character is converted to UTF-8 [IETF RFC 2279] as
>     one or more bytes.

I don't see any disallowed character under #2.1 of rfc2396 in your
testcase.

Applying the rather different rules you referenced is going to lead to
deeper bugs than this alleged one.

Your testcase was declared as iso-8859-1, so escaping as UTF-8 is
at best perverse, and breaks commonsense.  This is relevant here as
OpenSP groks SGML (and on the web in general where agents grok
some form of HTML).

If your testcase had declared a 16-bit charset, then AFAICS that rule
would lead to more brokenness.

I'm thinking as I write: what happens if we apply perverse-XML rules
when OpenSP's -wxml is in force?  This avoids breaking SGML, but
I'm not convinced about implementing it.

Terje, how are we applying iconv to incoming documents these days?
ISTM that any document that is converted to utf-8 before being
processed by OpenSP sidesteps this problem altogether (because
iconv does the job).

-- 
Nick Kew

Received on Wednesday, 4 December 2002 15:04:47 UTC