Re: internet media types and encoding from Rick Jelliffe on 2003-04-15 (www-tag@w3.org from April 2003)

From: Rick Jelliffe <ricko@topologi.com>
Date: Wed, 16 Apr 2003 01:11:44 +1000
To: <www-tag@w3.org>
Message-ID: <014d01c30361$53b75260$4bc8a8c0@AlletteSystems.com>

Tim Bray wrote
|
> I think we're kind of stuck with the C1 chars based on them having been 
> allowed in XML 1.0.  

I think this is not what Tim meant to write, but nevertheless:

1) The C1 characters in Unicode 3.1 are not the same characters as in Unicode 2.

Now that characters have been allocated to those code points, the choice on whether 
to *adopt* them comes down to the architectural issue of which layer XML belongs to:

  -- is it textual (something capable of being text/*, telnet, etc) or is is text (something 
that needs to be bin64 encoded, for example, when sent over protocols expecting 
textual data)?

  -- is imaginary error detection acceptable for mission-critical data?

2) It was entirely legitimate for XML processors to strip out the 
controls before they even reached the parser, at the transcoding stage,
because they belong to protocols not text.  As I mentioned, MSXML 4
did this with encoding 8859-1, because the controls are not defined
as part of 8859-1 (i.e. it is perfectly fine, and indeed good, if a transcoder
strips them from incoming 8859-1 text).  

Even if XML 1.1 says it allows control characters, it cannot guarantee
or specify that transcoders should not strip them (from byte-based encodings)
because that is the business of the encoding and the text protocol's details.  


Cheers
Rick Jelliffe

Received on Tuesday, 15 April 2003 11:07:51 UTC