- From: Rick Jelliffe <ricko@allette.com.au>
- Date: Tue, 30 Oct 2001 18:16:25 +1100
- To: "Elliotte Rusty Harold" <elharo@metalab.unc.edu>
- Cc: <xml-editor@w3.org>
(Copying this to the XML editors mail list. Could this be added as an erratum issue for XML 1.0 please?) From: "Elliotte Rusty Harold" <elharo@metalab.unc.edu> > >But 0x80 when present in data labelled 8859-1 does not have a legitimate mapping to > >Unicode, so it should fail as a transcoding error, not as a Unicode error. > > No, it does have a legal mapping. 0x80 in 8859-1 is the same as 0x80 in Unicode. > If I'm not mistaken, it's a C1 control character which is legal in XML #PCDATA > and CDATA. Tim Bary's admitted that this is a design flaw in XML, but it is one > we have to live with. 0x80 is not an invalid character. The design rationale of all the ISO 8859 character encodings was that they must be capable of being used by transmission systems which are not aware of the character length of transmission (i.e. 7 bit or 8-bit) or the parity.[1][2] Such systems must mask the top bit. As such, the characters 80 to 9F are reserved as control characters, but not defined, for robustness. [3] ISO 8859/1 uses Latin Alphabet #1, see http://www.itscj.ipsj.or.jp/ISO-IR/100.pdf for the right hand part. One of the criteria for Unicode is round-tripping. So, even though the 80-7F characters are not defined but merely reserved, they are still included. In Unicode 1.0 they were not defined, but reserved. But since Unicode 3.0 (September 1999) the C1 characters of ISO 6429 occupy those control points. (This fits in with TR 17 http://www.unicode.org/unicode/reports/tr17/ where the issue of round-tripping of parity bits becomes a matter for the Character Encoding Form, and hence not a matter for Unicode to worry about.) So I believe the ISO 8859-1 mapping tables are a little misleading, because ISO 8859-1 does not define the control characters while Unicode 3.0 now does. Unicode 3.1 recommends about the 80-9F characters, in chapter 13.1[4], that "in the absense of specific application uses, they may be interpreted according to the semantics specified by ISO 6429" The new version of ISO 6429 is available online as ECMA-48. [5] It does not define a character for C1 point 00 and in fact states that "unallocated bit combinations are reserved for future use and should not be used." XML does not allocate a semantic to 80. I believe it is completely consistent for an implementer to hold that this means that the character meaning is delegated to Unicode, and that Unicode delegates it to ISO 6429, and that ISO 6429 reserves it and says it should not be used. It also has the practical effect of catching much UTF-8 data which has been incorrectly labelled. All that being said, I agree that there is nothing specific in the XML spec to force this, and that people may think that it is up to a higher level protocol (i.e, whatever XML is used for) to define the character. Or it may be decided to allow the unallocated code points as a matter of future-proofing. I guess the best thing is for XML 1.0 to state explcitly that "The C1 control characters follow ISO 6429 as ammended." That makes something explicit that otherwise requires detective work and handwaving. It means that until ISO 6429 defines otherwise, a processor may barf when presented with U+0080, but it does not force implementations to catch it (they may decide to not catch it as a matter of future-proofing.) But the user is warned. Cheers Rick Jelliffe [1] See http://ppewww.ph.gla.ac.uk/~flavell/iso8859/iso8859-pointers.html "The code points 0-31 and 127 are assigned to control characters in US-ASCII, not to displayable glyphs, and the ISO-8859-1 code continues this tradition, as well as declaring the range 128-159 inclusive to be reserved for unspecified control functions: historically, this was intended to protect against 7-bit data paths that would lose the top bit and risk performing some unexpected control function, such as clearing the display! " [2]http://wwwwbs.cs.tu-berlin.de/user/czyborra/charsets/ "Characters 0 to 127 are always identical with US-ASCII and the positions 128 to 159 hold control characters nobody ever uses. " [3] http://www.cs.ruu.nl/wais/html/na-dir/internationalization/iso-8859-1-charset.html "The characters 0x80 through 0x9f are earmarked as extended control chracters, and are not used for encoding characters. These characters are not currently used to specify anything. A practical reason for this is interoperability with 7 bit devices (or when the 8th bit gets stripped by faulty software). Devices would then interpret the character as some control character and put the device in an undefined state. (When the 8th bit gets stripped from the characters at 0xa0 to 0xff, a wrong character is represented, but this cannot change the state of a terminal or other device.)" [4] http://www.unicode.org/unicode/uni2book/ch13.pdf [5] ftp://ftp.ecma.ch/ecma-st/Ecma-048.pdf Note also other material on characters [6] http://www.unicode.org/unicode/reports/tr20/ [7] http://www.w3.org/TR/charmod/ on private use characters "However, their use is strongly discouraged, since private agreements do not scale on the Web."
Received on Tuesday, 30 October 2001 02:09:32 UTC