Re: internet media types and encoding

On Tuesday, April 15, 2003, 5:11:44 PM, Rick wrote:


RJ> Tim Bray wrote
RJ> |
>> I think we're kind of stuck with the C1 chars based on them having been 
>> allowed in XML 1.0.  

RJ> I think this is not what Tim meant to write, but nevertheless:

RJ> 1) The C1 characters in Unicode 3.1 are not the same characters as in Unicode 2.

RJ> Now that characters have been allocated to those code points,

OK and I said that, too, but luckily I thought to check with Mark
Davis, the chair of the Unicode consortium about that. And I was
wrong. He said:

MD> Ah. The character *names* are actually undefined, and simply
MD> marked by "<control>". What you are thinking of as the names are
MD> simply aliases pointing to the ISO 6429 usage. See
MD> http://www.unicode.org/charts/PDF/U0080.pdf.

MD> So the Unicode Standard does not define U+0082 to mean "BREAK
MD> PERMITTED HERE". It just says that there is a control code, one
MD> which in ISO 6429 has that name and meaning. But implementers of
MD> the Unicode Standard are not required to interpret the U+0082 in
MD> the ISO 6429 way.

MD> As I said, there are a few exceptions, listed below, where the
MD> standard does assign semantics. But it does not give the vast
MD> majority any semantics at all.

So, they are mainly just 'unidentified control characters' with no
defined semantics in terms of Unicode.

RJ> 2) It was entirely legitimate for XML processors to strip out the 
RJ> controls before they even reached the parser, at the transcoding stage,
RJ> because they belong to protocols not text.

Thats an interesting point.

RJ> As I mentioned, MSXML 4 did this with encoding 8859-1, because the
RJ> controls are not defined as part of 8859-1 (i.e. it is perfectly
RJ> fine, and indeed good, if a transcoder strips them from incoming
RJ> 8859-1 text).



-- 
 Chris                            mailto:chris@w3.org

Received on Tuesday, 15 April 2003 12:55:08 UTC