Re: internet media types and encoding from Chris Lilley on 2003-04-11 (www-tag@w3.org from April 2003)

From: Chris Lilley <chris@w3.org>
Date: Fri, 11 Apr 2003 10:48:52 +0200
To: "Rick Jelliffe" <ricko@topologi.com>
CC: www-tag@w3.org
Message-ID: <171227887734.20030411104852@w3.org>
On Friday, April 11, 2003, 9:19:54 AM, Rick wrote:


RJ> From: "Chris Lilley" <chris@w3.org>

>> RJ> XML 1.0 advanced textual formats by providing a workable labelling
>> RJ> mechanism for encoding. But we need a verification mechanism too:--
>> RJ> when we go up the protocol stacks XML is somewhat of a weak link.
>> 
>> xml:md5 ?

RJ> An MD5 produced as a checksum on the UTF-16 version of the
RJ> document would work better than redundancy-based checks, which
RJ> miss many important cases (e.g., different versions of ISO
RJ> 8859-1--XML1.1 could be improved by strictly disallowing division
RJ> and multiply in name characters, which would catch some more
RJ> encoding errors between 8859-1 codes. The U+0080 to U+00FF is
RJ> where the lion's share of detectable problems can be found, and it
RJ> should have as many redundant points as possible, both for literal
RJ> characters and name characters.)

While following your point, such considerations played little to no
part in the standardisation of that area of unicode.

RJ> But to be effective, an xml:md5 needs to be produced at the time the
RJ> document is created, which gives us the same trouble as we have with
RJ> character encodings: if producing software were smart enough to
RJ> add an MD5 then it would be smart enough to generate the correct
RJ> encoding.

Yes (though it would make brief deletions, insertions and corruptions
a lot easier to spot.) Which leads to 'smart enough' software in
general and proxies in particular.

If proxies are bus y altering the bytes in content, they had better
know that they are allowed to do that. Making such a transcode
xml-aware does not seem very hard and apprears to have significant
value if people want XML documents to have a widely understood
encoding *UTF8/16) on the server and deliver them in less
globally-understood encodings (Big5, Shift-JIS, 8859-1). But such
transcoding proxy had better do it *correctly* and find and fix the
one line of encoding declaration so the file is still well formed.
making it non wwell formed and then temporarily disguising the fact
with a trumping charset parameter is deeply broken architecture.

>> Detect, or correct?

RJ> Detect. The pattern and number of redundant code points does not allow
RJ> correction.

OK good, just checking ;-)

>> Its abundantly clear that all versions of Unicode from 1.0 to 4.0beta
>> have said and continue to say that 80 to 9F are control codes, not
>> printable characters (and further, they say what codes they are and
>> none of them have any business being in a markup language).

RJ> The original Unicode only said they were reserved as control
RJ> codes, but didn't say what they were.

I would need to go back and check a Unicode 1.1 book. My book is
Unicode 3.0, and has the precise control codes named. In general the
names are established early and then never changed even if wrong..

RJ> This is to allow different
RJ> uses, and because they are second class citizens, and because
RJ> the semantics and usage of control codes is so waffly: e.g. backspace.
RJ> What does end-of-transmission mean in an XML data stream,
RJ> when appearing directly?

I agree that these control codes have no business in a marked up
document, either with their defined meanings or with some undefined,
application specific meaning.

RJ> Even within the C1 range, not all control points are allocated.
RJ> For example, 0x81 is not allocated to a particular control
RJ> character IIRC.

Correct, 80, 81 and 99 have no defined meanings. 91 and 92 are private
use 1 and private use 2, respectively which is not much better.

RJ> (This is where my other post to TAG comes in, the one suggesting 
RJ> that there should be a distinction made between standard, extended,
RJ> private, and underworld.

I read that and have been thinking about it, but not responded yet. My
initial impression is that i can see where you are going with it, but
also worry that it legitimizes the 'anything goes' approach and makes
backend processing more of a mire than it already is. Or maybe it just
points out that it is a mire, I don't know. Since i have not formed  a
coherent response yet i didn't post one.

RJ> The C1 controls are not suited for use even by reference except in
RJ> standard, private and underworld XML:

Do you mean extended, private and underworld? If not, I don't follow.

RJ> they are just like Private
RJ> Use Area characters in that regard-- unless the other end knows
RJ> what you mean, they are not appropriate. )

Yes. And sometimes not even then.

<?broken-escape-designator unicode="&#x91;" meaning="japanese"?>
<?broken-escape-designator unicode="&#x92;" meaning="english"?>

That piece of unmitigated hackery tells 'the other end' or at least
some other ends to use particular private use control character to
switch between japanese to english and back again in a stream like way
regardless of element boundaries and tree structure.

Its deeply broken (and was deliberately designed to be obviously
broken) just to show that stream control codes and tree structures do
not mix.

Odd you should mention private use characters, I was looking into
emoji this wek ;-) oh joy.

-- 
 Chris                            mailto:chris@w3.org
Received on Friday, 11 April 2003 04:48:59 UTC