- From: Chris Lilley <chris@w3.org>
- Date: Fri, 11 Apr 2003 10:48:52 +0200
- To: "Rick Jelliffe" <ricko@topologi.com>
- CC: www-tag@w3.org
On Friday, April 11, 2003, 9:19:54 AM, Rick wrote: RJ> From: "Chris Lilley" <chris@w3.org> >> RJ> XML 1.0 advanced textual formats by providing a workable labelling >> RJ> mechanism for encoding. But we need a verification mechanism too:-- >> RJ> when we go up the protocol stacks XML is somewhat of a weak link. >> >> xml:md5 ? RJ> An MD5 produced as a checksum on the UTF-16 version of the RJ> document would work better than redundancy-based checks, which RJ> miss many important cases (e.g., different versions of ISO RJ> 8859-1--XML1.1 could be improved by strictly disallowing division RJ> and multiply in name characters, which would catch some more RJ> encoding errors between 8859-1 codes. The U+0080 to U+00FF is RJ> where the lion's share of detectable problems can be found, and it RJ> should have as many redundant points as possible, both for literal RJ> characters and name characters.) While following your point, such considerations played little to no part in the standardisation of that area of unicode. RJ> But to be effective, an xml:md5 needs to be produced at the time the RJ> document is created, which gives us the same trouble as we have with RJ> character encodings: if producing software were smart enough to RJ> add an MD5 then it would be smart enough to generate the correct RJ> encoding. Yes (though it would make brief deletions, insertions and corruptions a lot easier to spot.) Which leads to 'smart enough' software in general and proxies in particular. If proxies are bus y altering the bytes in content, they had better know that they are allowed to do that. Making such a transcode xml-aware does not seem very hard and apprears to have significant value if people want XML documents to have a widely understood encoding *UTF8/16) on the server and deliver them in less globally-understood encodings (Big5, Shift-JIS, 8859-1). But such transcoding proxy had better do it *correctly* and find and fix the one line of encoding declaration so the file is still well formed. making it non wwell formed and then temporarily disguising the fact with a trumping charset parameter is deeply broken architecture. >> Detect, or correct? RJ> Detect. The pattern and number of redundant code points does not allow RJ> correction. OK good, just checking ;-) >> Its abundantly clear that all versions of Unicode from 1.0 to 4.0beta >> have said and continue to say that 80 to 9F are control codes, not >> printable characters (and further, they say what codes they are and >> none of them have any business being in a markup language). RJ> The original Unicode only said they were reserved as control RJ> codes, but didn't say what they were. I would need to go back and check a Unicode 1.1 book. My book is Unicode 3.0, and has the precise control codes named. In general the names are established early and then never changed even if wrong.. RJ> This is to allow different RJ> uses, and because they are second class citizens, and because RJ> the semantics and usage of control codes is so waffly: e.g. backspace. RJ> What does end-of-transmission mean in an XML data stream, RJ> when appearing directly? I agree that these control codes have no business in a marked up document, either with their defined meanings or with some undefined, application specific meaning. RJ> Even within the C1 range, not all control points are allocated. RJ> For example, 0x81 is not allocated to a particular control RJ> character IIRC. Correct, 80, 81 and 99 have no defined meanings. 91 and 92 are private use 1 and private use 2, respectively which is not much better. RJ> (This is where my other post to TAG comes in, the one suggesting RJ> that there should be a distinction made between standard, extended, RJ> private, and underworld. I read that and have been thinking about it, but not responded yet. My initial impression is that i can see where you are going with it, but also worry that it legitimizes the 'anything goes' approach and makes backend processing more of a mire than it already is. Or maybe it just points out that it is a mire, I don't know. Since i have not formed a coherent response yet i didn't post one. RJ> The C1 controls are not suited for use even by reference except in RJ> standard, private and underworld XML: Do you mean extended, private and underworld? If not, I don't follow. RJ> they are just like Private RJ> Use Area characters in that regard-- unless the other end knows RJ> what you mean, they are not appropriate. ) Yes. And sometimes not even then. <?broken-escape-designator unicode="‘" meaning="japanese"?> <?broken-escape-designator unicode="’" meaning="english"?> That piece of unmitigated hackery tells 'the other end' or at least some other ends to use particular private use control character to switch between japanese to english and back again in a stream like way regardless of element boundaries and tree structure. Its deeply broken (and was deliberately designed to be obviously broken) just to show that stream control codes and tree structures do not mix. Odd you should mention private use characters, I was looking into emoji this wek ;-) oh joy. -- Chris mailto:chris@w3.org
Received on Friday, 11 April 2003 04:48:59 UTC