Re: internet media types and encoding from Paul Grosso on 2003-04-09 (www-tag@w3.org from April 2003)

From: Paul Grosso <pgrosso@arbortext.com>
Date: Wed, 09 Apr 2003 08:51:16 -0500
To: <www-tag@w3.org>
Message-Id: <4.3.2.7.2.20030409083737.02286de8@172.27.10.30>

At 20:26 2003 04 09 +1000, Rick Jelliffe wrote:
>For encoding error-detection, XML 1.1 takes one small step backwards 
>(by opening up the characters used in names) but then takes a very large 
>step forwards (by not allowing most C1 control characters directly). 
>(The C1 controls are roughly U+0080-U+009F: reserving these is enough
>to detect many common encoding errors, in particular mislabelling
>character sets --such as Big 5 or Win 1252 "ANSI"-- as ISO 8859-1.)

The XML Core WG has not resolved this open issue yet, so I for one
wouldn't mind understanding this better.

The current text in the XML 1.1 CR disallows the C1 control characters
directly in well-formed XML (instead, they must be escaped using
numeric character references).  This is the only thing in XML 1.1 that
prevents certain potential (if rare) well-formed XML 1.0 documents from
being turned into well-formed XML 1.1 documents by merely changing the
version number in the XML declaration.

I am unclear on the benefits of this.  In exchange for making some
well-formed XML 1.0 documents no longer well-formed XML 1.1, what
exactly are we getting?  I gather the answer is greater "encoding
error detection," that is, the ability to reject yet more documents.

I'm not yet sure what I think of this, and the XML Core WG has
members on both sides of this issue.  If someone could make a clear
cost/benefit argument here, it might help some of us on the fence.

paul

[speaking for myself, not the XML Core WG]

Received on Wednesday, 9 April 2003 09:56:34 UTC