Re: internet media types and encoding

From: "Chris Lilley" <chris@w3.org>

> RJ> XML 1.0 advanced textual formats by providing a workable labelling
> RJ> mechanism for encoding. But we need a verification mechanism too:--
> RJ> when we go up the protocol stacks XML is somewhat of a weak link.
> 
> xml:md5 ?

An MD5 produced as a checksum on the UTF-16 version of the document
would work better than redundancy-based checks, which miss many important
cases (e.g., different versions of ISO 8859-1--XML1.1 could be improved by
strictly disallowing division and multiply in name characters, which would catch
some more encoding errors between 8859-1 codes. The U+0080 to U+00FF 
is where the lion's share of detectable problems can be found, and it should have
as many redundant points as possible, both for literal characters and name
characters.)

But to be effective, an xml:md5 needs to be produced at the time the
document is created, which gives us the same trouble as we have with
character encodings: if producing software were smart enough to
add an MD5 then it would be smart enough to generate the correct
encoding. 

> Detect, or correct?

Detect. The pattern and number of redundant code points does not allow
correction. 

> Its abundantly clear that all versions of Unicode from 1.0 to 4.0beta
> have said and continue to say that 80 to 9F are control codes, not
> printable characters (and further, they say what codes they are and
> none of them have any business being in a markup language).

The original Unicode only said they were reserved as control
codes, but didn't say what they were. This is to allow different
uses, and because they are second class citizens, and because
the semantics and usage of control codes is so waffly: e.g. backspace.
What does end-of-transmission mean in an XML data stream,
when appearing directly?

Even within the C1 range, not all control points are allocated.
For example, 0x81 is not allocated to a particular control
character IIRC.

(This is where my other post to TAG comes in, the one suggesting 
that there should be a distinction made between standard, extended,
private, and underworld. The C1 controls are not suited for use 
even by reference except in standard,  private and underworld 
XML: they are just like Private Use Area characters in that regard--
unless the other end knows what you mean, they are not appropriate. )

Cheers
Rick Jelliffe

Received on Friday, 11 April 2003 03:15:59 UTC