- From: Rick Jelliffe <ricko@topologi.com>
- Date: Fri, 11 Apr 2003 17:19:54 +1000
- To: <www-tag@w3.org>
From: "Chris Lilley" <chris@w3.org> > RJ> XML 1.0 advanced textual formats by providing a workable labelling > RJ> mechanism for encoding. But we need a verification mechanism too:-- > RJ> when we go up the protocol stacks XML is somewhat of a weak link. > > xml:md5 ? An MD5 produced as a checksum on the UTF-16 version of the document would work better than redundancy-based checks, which miss many important cases (e.g., different versions of ISO 8859-1--XML1.1 could be improved by strictly disallowing division and multiply in name characters, which would catch some more encoding errors between 8859-1 codes. The U+0080 to U+00FF is where the lion's share of detectable problems can be found, and it should have as many redundant points as possible, both for literal characters and name characters.) But to be effective, an xml:md5 needs to be produced at the time the document is created, which gives us the same trouble as we have with character encodings: if producing software were smart enough to add an MD5 then it would be smart enough to generate the correct encoding. > Detect, or correct? Detect. The pattern and number of redundant code points does not allow correction. > Its abundantly clear that all versions of Unicode from 1.0 to 4.0beta > have said and continue to say that 80 to 9F are control codes, not > printable characters (and further, they say what codes they are and > none of them have any business being in a markup language). The original Unicode only said they were reserved as control codes, but didn't say what they were. This is to allow different uses, and because they are second class citizens, and because the semantics and usage of control codes is so waffly: e.g. backspace. What does end-of-transmission mean in an XML data stream, when appearing directly? Even within the C1 range, not all control points are allocated. For example, 0x81 is not allocated to a particular control character IIRC. (This is where my other post to TAG comes in, the one suggesting that there should be a distinction made between standard, extended, private, and underworld. The C1 controls are not suited for use even by reference except in standard, private and underworld XML: they are just like Private Use Area characters in that regard-- unless the other end knows what you mean, they are not appropriate. ) Cheers Rick Jelliffe
Received on Friday, 11 April 2003 03:15:59 UTC