- From: Chris Lilley <chris@w3.org>
- Date: Wed, 9 Apr 2003 22:29:56 +0200
- To: Paul Grosso <pgrosso@arbortext.com>
- CC: www-tag@w3.org
On Wednesday, April 9, 2003, 3:51:16 PM, Paul wrote: PG> At 20:26 2003 04 09 +1000, Rick Jelliffe wrote: >>For encoding error-detection, XML 1.1 takes one small step backwards >>(by opening up the characters used in names) but then takes a very large >>step forwards (by not allowing most C1 control characters directly). >>(The C1 controls are roughly U+0080-U+009F: reserving these is enough >>to detect many common encoding errors, in particular mislabelling >>character sets --such as Big 5 or Win 1252 "ANSI"-- as ISO 8859-1.) PG> The XML Core WG has not resolved this open issue yet, so I for one PG> wouldn't mind understanding this better. PG> The current text in the XML 1.1 CR disallows the C1 control characters PG> directly in well-formed XML (instead, they must be escaped using PG> numeric character references). This is the only thing in XML 1.1 that PG> prevents certain potential (if rare) well-formed XML 1.0 documents from PG> being turned into well-formed XML 1.1 documents by merely changing the PG> version number in the XML declaration. I think that if you take a large sample of documents purporting to be xhtml, and consider only that proportion of them that are well formed (!) you will find that both raw codepoints and NCRs corresponding to CP-1252 printable characters are used as if Unicode included the same characters at the same codepoints. If the document is labelled as UTF-8 or ISO 8859-1 or US-ASCII then both uses are wrong; if by chance the document encoding labels it as CP-1252 then the NCRs are still wrong. From what Rick Jelliffe is saying, there are probably substantial numbers of Chinese language documents that do the same thing. In other words, a clear distinction is not being made between the encoding (whatever the encoding declaration says it is) and the document character set, of which there is only one (because XML does not have an SGML declaration that could change the document character set). PG> I am unclear on the benefits of this. In exchange for making some PG> well-formed XML 1.0 documents no longer well-formed XML 1.1, If they used NCRs then they were already not well formed. If they used bytes that are correct for the declared encoding then they are still well formed. So, its only XML 1.0 documents that used controls like - break permitted here - index - character tabulation set - message waiting - start of guarded area - end of guarded area - start of string - string terminator and so forth - inherited for legacy reasons from ISO 6429, and for roundtripping compatibility only - that would be made not well formed. PG> what exactly are we getting? I gather the answer is greater PG> "encoding error detection," that is, the ability to reject yet PG> more documents. And even in those cases, if those control codes really were meant, they can still be automatically batch converted to the escaped form. Which then improves the security section of the text/xml media type because when a non-xml-aware user agent does fallback on the text/xml content (yeah, right, we are already in deep waters here) it does not get these control codes blasted at its 1970's era terminal emulator or whatever is displaying it. PG> I'm not yet sure what I think of this, and the XML Core WG has PG> members on both sides of this issue. If someone could make a clear PG> cost/benefit argument here, it might help some of us on the fence. For HTML 2.0, I was the squeaky wheel that got the entities for all (not just some) of the Latin-1 supplement characters added and for 4.0, likewise, I was one of those who got tables of Latin-1 supplement and 'symbol font (doh)' entities (mapped to the correct unicode codepoints and missing out things that were not characters). I found that this did help in terms of educating people about the ISO character glyph model and making them see that a glyph, a character, and a sequence of bytes in a particular encoding were not the same thing. XML 1.1 continues in this process of architectural clarity. Thus, it is important that raw C1 control codes are not allowed in XML 1.1 entities, and if people want 'typographic quotes' or 'S with caron' then they should use the correct Unicode values for them and not, for example, pretend that they are shipping around control codes and 'auto correct' those to 'current Windows code page' values. -- Chris mailto:chris@w3.org
Received on Wednesday, 9 April 2003 16:30:07 UTC