Re: [xml-dev] Text/xml with omitted charset parameter from Rick Jelliffe on 2001-10-30 (xml-editor@w3.org from October to December 2001)

From: Rick Jelliffe <ricko@allette.com.au>
Date: Tue, 30 Oct 2001 18:16:25 +1100
To: "Elliotte Rusty Harold" <elharo@metalab.unc.edu>
Cc: <xml-editor@w3.org>
Message-ID: <007301c16112$c925d550$4bc8a8c0@AlletteSystems.com>
(Copying this to the XML editors mail list. Could this be added as an erratum issue for XML 1.0 please?)

From: "Elliotte Rusty Harold" <elharo@metalab.unc.edu>

> >But 0x80 when present in data labelled 8859-1 does not have a legitimate mapping to
> >Unicode, so it should fail as a transcoding error, not as a Unicode error.
> 
> No, it does have a legal mapping. 0x80 in 8859-1 is the same as 0x80 in Unicode.   
> If I'm not mistaken, it's a C1 control character which is legal in XML #PCDATA 
> and CDATA. Tim Bary's admitted that this is a design flaw in XML, but it is one 
> we have to live with. 0x80 is not an invalid character. 
 
The design rationale of all the ISO 8859 character encodings was that they must be 
capable of being used by transmission systems which are not aware of the
character length of transmission (i.e. 7 bit or 8-bit) or the parity.[1][2]

Such systems must mask the top bit.  As such, the characters 80 to 9F are
reserved as control characters, but not defined, for robustness. [3]

ISO 8859/1 uses Latin Alphabet #1, see http://www.itscj.ipsj.or.jp/ISO-IR/100.pdf  for the
right hand part. 

One of the criteria for Unicode is round-tripping. So, even though the 80-7F characters
are not defined but merely reserved, they are still included. In Unicode 1.0 they were not
defined, but reserved.

But since Unicode 3.0 (September 1999) the C1 characters of ISO 6429 occupy those 
control points. (This fits in with  TR 17 http://www.unicode.org/unicode/reports/tr17/
where the issue of round-tripping of parity bits becomes a matter for the Character
Encoding Form, and hence not a matter for Unicode to worry about.)

So I believe the ISO 8859-1 mapping tables are a little misleading, because ISO 8859-1 
does not define the control characters while Unicode 3.0 now does.   

Unicode 3.1 recommends about the 80-9F characters, in chapter 13.1[4], that "in the 
absense of specific application uses, they may be interpreted according to the 
semantics specified by ISO 6429"

The new version of ISO 6429 is available online as ECMA-48. [5]
It does not define a character for C1 point 00 and in fact states that 
"unallocated bit combinations are reserved for future use and should not
be used."

XML does not allocate a semantic to 80. I believe it is completely consistent
for an implementer to hold that this means that the character meaning
is delegated to Unicode, and that Unicode delegates it to ISO 6429, and
that ISO 6429 reserves it and says it should not be used.   

It also has the practical effect of catching much UTF-8 data which has been
incorrectly labelled.

All that being said, I agree that there is nothing specific in the XML spec
to force this, and that people may think that it is up to a higher level 
protocol (i.e, whatever XML is used for) to define the character. 
Or it may be decided to allow the unallocated code points as a matter
of future-proofing. 

I guess the best thing is for XML 1.0 to state explcitly that "The C1 control 
characters follow ISO 6429 as ammended."  That makes something explicit that 
otherwise requires detective work and handwaving.  It means that
until ISO 6429 defines otherwise, a processor may barf when presented
with U+0080, but it does not force implementations to catch it (they
may decide to not catch it as a matter of future-proofing.)  But the user is
warned.

Cheers
Rick Jelliffe

[1] See http://ppewww.ph.gla.ac.uk/~flavell/iso8859/iso8859-pointers.html
"The code points 0-31 and 127 are assigned to control characters in US-ASCII, not to displayable glyphs, and the ISO-8859-1 code continues this tradition, as well as declaring the range 128-159 inclusive to be reserved for unspecified control functions: historically, this was intended to protect against 7-bit data paths that would lose the top bit and risk performing some unexpected control function, such as clearing the display! "

[2]http://wwwwbs.cs.tu-berlin.de/user/czyborra/charsets/
"Characters 0 to 127 are always identical with US-ASCII and the positions 128 to 159 hold control characters nobody ever uses. "

[3] http://www.cs.ruu.nl/wais/html/na-dir/internationalization/iso-8859-1-charset.html
"The characters 0x80 through 0x9f are earmarked as extended control
chracters, and are not used for encoding characters.  These characters
are not currently used to specify anything.  A practical reason for
this is interoperability with 7 bit devices (or when the 8th bit gets
stripped by faulty software).  Devices would then interpret the character
as some control character and put the device in an undefined state.
(When the 8th bit gets stripped from the characters at 0xa0 to 0xff, a
wrong character is represented, but this cannot change the state of a
terminal or other device.)"

[4] http://www.unicode.org/unicode/uni2book/ch13.pdf

[5] ftp://ftp.ecma.ch/ecma-st/Ecma-048.pdf


Note also other material on characters
[6] http://www.unicode.org/unicode/reports/tr20/
[7] http://www.w3.org/TR/charmod/  on private use characters
"However, their use is strongly discouraged, since private agreements do not scale on the Web."
Received on Tuesday, 30 October 2001 02:09:32 UTC