Re: internet media types and encoding

Paul Grosso <pgrosso@arbortext.com>

> So my question is, since one will probably have to do even more for the
> kind of reliability you want, why leave in this one incompatibility?  Is
> the cost of breaking backward compatibility with XML 1.0 worth the benefit,
> given that you've just admitted you still don't have your bullet-proof
> reliability?

I will answer in two ways.

Breaking Compatability
---------------------------

First, on the "cost of breaking backwards compatability with XML 1.0".
As you know, Unicode 1.0 reserved the C1 range but did not assign
any characters to it.  At the time XML 1.0 was released, the control
characters had no semantics.

Since Unicode 3.0, the C1 control codes now do have semantics: the ISO
6429. That means that people who used those codes with different semantics
are not conforming to Unicode 3.0. The XML 1.1 revision's purpose
is to align XML with Unicode 3.1 and future versions.  So anyone who has
used those characters with different semantics is not conforming with
Unicode 3.n. and we don't need to support them. 

I note that it does not necessarily break compatability with implementations.

For example, MSXML 4 (as used in my company's freebie validator for
WXS, Schematron, RELAX NG, etc) barfed if faced with C1 controls.
It was acting correctly in this, because the presence of a literal control
character in a text stream is either a sign of an error (eg. EOT) or
of some non-textual use (e.g. what would BS be doing in a document). 
(The validator rejected the controls not because they were allowed or disallowed
in Unicode, but because they were inappropriate when found in an 8859-1
data stream, I would say.  )

About the very first support question on our validator in 0ct 2001 was 
someone who had a Euro in 0x80 but it was labelled 8859-1: they reported it 
as a bug. Redundant-code error-detection works.

See http://lists.w3.org/Archives/Public/xml-editor/2001OctDec/0004.html
for more info on C1 control characters.

See http://lists.xml.org/archives/xml-dev/200109/msg00259.html
for more info on C0 control characters.

See http://www.xml.com/pub/a/2002/09/18/euroxml.html
for discussion of Euro, especially box "How Could XML 1.1 Help?"

All or Nothing
-----------------

Next, the issue that it is still not enough.

I have explained already in a previous post that error-detection by
exploiting code redundancy and a checksum (xml:md5) are applicable
in different cases. Having one reduces the need for the other, but
they don't cover exactly the same cases. 

Probably code redundancy might more catch human error (editing, programmer
used default encoding to read or write, data comes from corrupt
database) while an xml:md5 might more catch system errors
(e.g. corrupting transcoding, processors that process by byte but
make incorrect assumptions).  

(The same goes for restricting name characters: it can find things
that code redundancy will not. Somewhere I gave an example of
this with the Greek 8859-? character encoding mislabelled as 
8859-1.  However, the XML Core WG does not want to utilize
redundancy in this way, so the C1 controls are the only game in town.)

Now, as I have pointed out, using code redundancy will not catch any
errors where two encodings have common feasible code sequences
that don't involve the C1 range. For example, ISO8859-2 mislabelled
as ISO 8859-1.  The only way to attempt to detect those is through
name checking.  (And, to flog a dead horse, it is completely spurious
to say that we cannot make use of allocated code points because
we need to be future-compatible: it is ludicrous to think that Unicode
Consortium will  drop letters out of the Greek alphabet or change
ISO 8859-1 so that "multiply" becomes a letter.  Future-proofing
XML to Unicode evolution does not imply that existing allocated
characters cannot be used for redundancy checking at the character
level: it is only unallocated character positions that XML 1.1. needs to
be open to.)

What code redundancy will find is where a proprietary extension 
to an ISO standard character set has been used, but labelled as
the ISO set, and many encoding issues for CJK.  See
http://www.topologi.com/resources/XML_Naming_Rules.html
for some details.

All-or-nothing is not the choice, and there is no need to railroad
ourselves into thinking it is. The choice is some-or-nothing. 

The XML Core WG should discover and maintain the strengths of 
XML.  Perhaps the TAG has a role to figure out  the robustness objectives 
for XML if it is to be used for important data transfers. Checking for 
characters in the C1 range only involves a small range-check: 
it is hard to imagine there could be any other low-hanging fruit 
hanging so low.

Cheers
Rick Jelliffe

Received on Friday, 11 April 2003 16:46:32 UTC