Re: internet media types and encoding from Chris Lilley on 2003-04-09 (www-tag@w3.org from April 2003)

From: Chris Lilley <chris@w3.org>
Date: Wed, 9 Apr 2003 22:29:56 +0200
To: Paul Grosso <pgrosso@arbortext.com>
CC: www-tag@w3.org
Message-ID: <7797158812.20030409222956@w3.org>
On Wednesday, April 9, 2003, 3:51:16 PM, Paul wrote:


PG> At 20:26 2003 04 09 +1000, Rick Jelliffe wrote:
>>For encoding error-detection, XML 1.1 takes one small step backwards 
>>(by opening up the characters used in names) but then takes a very large 
>>step forwards (by not allowing most C1 control characters directly). 
>>(The C1 controls are roughly U+0080-U+009F: reserving these is enough
>>to detect many common encoding errors, in particular mislabelling
>>character sets --such as Big 5 or Win 1252 "ANSI"-- as ISO 8859-1.)

PG> The XML Core WG has not resolved this open issue yet, so I for one
PG> wouldn't mind understanding this better.

PG> The current text in the XML 1.1 CR disallows the C1 control characters
PG> directly in well-formed XML (instead, they must be escaped using
PG> numeric character references).  This is the only thing in XML 1.1 that
PG> prevents certain potential (if rare) well-formed XML 1.0 documents from
PG> being turned into well-formed XML 1.1 documents by merely changing the
PG> version number in the XML declaration.

I think that if you take a large sample of documents purporting to be
xhtml, and consider only that proportion of them that are well formed
(!) you will find that both raw codepoints and NCRs corresponding to
CP-1252 printable characters are used as if Unicode included the same
characters at the same codepoints. If the document is labelled as
UTF-8 or ISO 8859-1 or US-ASCII then both uses are wrong; if by chance
the document encoding labels it as CP-1252 then the NCRs are still
wrong.

From what Rick Jelliffe is saying, there are probably substantial
numbers of Chinese language documents that do the same thing.

In other words, a clear distinction is not being made between the
encoding (whatever the encoding declaration says it is) and the
document character set, of which there is only one (because XML does
not have an SGML declaration that could change the document character
set).

PG> I am unclear on the benefits of this.  In exchange for making some
PG> well-formed XML 1.0 documents no longer well-formed XML 1.1,

If they used NCRs then they were already not well formed. If they used
bytes that are correct for the declared encoding then they are still
well formed. So, its only XML 1.0 documents that used controls like

- break permitted here
- index
- character tabulation set
- message waiting
- start of guarded area
- end of guarded area
- start of string
- string terminator

and so forth - inherited for legacy reasons from ISO 6429, and for
roundtripping compatibility only - that would be made not well formed.

PG> what exactly are we getting? I gather the answer is greater
PG> "encoding error detection," that is, the ability to reject yet
PG> more documents.

And even in those cases, if those control codes really were meant,
they can still be automatically batch converted to the escaped form.
Which then improves the security section of the text/xml media type
because when a non-xml-aware user agent does fallback on the text/xml
content (yeah, right, we are already in deep waters here) it does not
get these control codes blasted at its 1970's era terminal emulator or
whatever is displaying it.

PG> I'm not yet sure what I think of this, and the XML Core WG has
PG> members on both sides of this issue.  If someone could make a clear
PG> cost/benefit argument here, it might help some of us on the fence.

For HTML 2.0, I was the squeaky wheel that got the entities for all
(not just some) of the Latin-1 supplement characters added and for
4.0, likewise, I was one of those who got tables of Latin-1 supplement
and 'symbol font (doh)' entities (mapped to the correct unicode
codepoints and missing out things that were not characters).

I found that this did help in terms of educating people about the ISO
character glyph model and making them see that a glyph, a character,
and a sequence of bytes in a particular encoding were not the same
thing.

XML 1.1 continues in this process of architectural clarity. Thus, it
is important that raw C1 control codes are not allowed in XML 1.1
entities, and if people want 'typographic quotes' or 'S with caron'
then they should use the correct Unicode values for them and not, for
example, pretend that they are shipping around control codes and 'auto
correct' those to 'current Windows code page' values.

-- 
 Chris                            mailto:chris@w3.org
Received on Wednesday, 9 April 2003 16:30:07 UTC