- From: Michael Sperberg-McQueen <U35395@UICVM.UIC.EDU>
- Date: Tue, 22 Oct 96 17:28:21 CDT
- To: W3C SGML Working Group <w3c-sgml-wg@w3.org>
On Tue, 22 Oct 1996 17:57:14 -0400 Gavin Nicol said: >>You cannot recognize the PI, _without having a list of the magic >>numbers for legal PI definitions_. If a user attempts to use a PI >>that does not exactly match one of the "the magic number formulas," >>then the processor may not even be able to recognize that a PI was >>present. So the apparent _self-descriptive_ aspect of the data is >>_not_ there. >Thank you David. This is a point I have felt, but been unable to >articulate. I'm not sure what David means by 'magic numbers' here, but if he means the IETF-defined values for the MIME charset field (or, XML Encoding attribute), I don't think this is true at all. Any XML processor will know what character sets (by which, for now, I mean 'coded character sets and/or encodings thereof') it can handle. When it encounters one it doesn't handle, I believe it's likely to fall into a case like the following: A. The processor accepts ISO 8859, UTF-8, and UCS-2. It gets a Shift-JIS entity, and says "Sorry; this entity is in a character encoding called 'Shift-JIS' which I don't handle." It was able to read and parse the PI, because in Shift-JIS all the characters in <?XML encoding='Shift-JIS' ?> are bit-identical to ISO 8859-*. B. The processor accepts EBCDIC, UTF-8, and UCS-2. It gets a Shift-JIS entity, and says "Sorry; this entity is in a character encoding called 'Shift-JIS' which I don't handle." It was able to read and parse the PI, because in Shift-JIS all the characters in <?XML encoding='Shift-JIS' ?> are bit-identical to UTF-8. C. The processor accepts Shift-JIS, UTF-8, and UCS-2. It gets an EBCDIC entity, and says "Sorry; this entity is in a character encoding which I don't handle. (There is also a chance that the entity has been trashed, or isn't in XML.)" The salient fact about the entity, which is that it's in an unknown character set or otherwise unprocessable, can be reliably detected, although the EBCDIC-encoded string 'ebcdic-cp37' cannot be deciphered. N.B. David is right to point out that labels can only be read by those capable of reading them. This is clearly a drawback, compared with a system in which they are always readable, even by those not capable of reading them. But the key fact here seems to me very simple, and accurately conveyed: this-entity-not-readable. Quick quiz: out of the members of the WG currently reading this (both of you!), how many might be able to tell their browser how to take corrective action if they knew the unreadable material was in something called 'ebcdic-cp37'? How about 'JOHAB'? >>This is true only for all the character sets that _we precode into >>XML_. It does not work for any new character set names. The PI looks >>like it has a parameter, but in fact the PI, and its parameter, >>constitute a magic string of bytes with no internal structure. This >>is a bit counterintuitive. I hope the examples above make clear why I think the limits on a processor's ability to identify the name of the encoding in use are a function NOT of the character set names precoded into XML, but of (a) the families of character sets the processor recognizes and (b) the family of character sets to which the particular entity in question actually belongs. >As is explaining to people that you can do: > > <?XML-ENCODING "SHIFT-JIS"> > ..... > >but not > > <?XML-ENCODING "SHIFT-JIS"> > .... > <?XML-ENCODING "UCS2"> > .... Well, I may be excessively idealistic, but I had thought "you can't change character encodings in the middle of a file" would do it for most readers, with an occasional "Because the software can't handle it" for the insistent few. For those of us with jaded stylistic palates and too many technical standards under our belts, it might be necessary to have a footnote saying something like "That is, Code extension functions for the ISO 2022 code extension techniques (such as designation escape sequence, single shift and locking shift), and character-encoding labeling functions as defined above, may not be used within the body of XML entities." >>I do not advocate losing the notion. But if it gets intolerable enough, >>maybe we can do the right thing after all! Judging by the response of the WG as a whole, they have already decided the 'right thing' involves installing bozo filters with our names on them. We haven't had a new argument in this discussion for some time, you're not persuading me, I'm not persuading you, and no one in their right mind is listening. Perhaps we should call it a thread and stop. Michael
Received on Tuesday, 22 October 1996 19:00:32 UTC