- From: Chris Lilley <chris@w3.org>
- Date: Tue, 15 Apr 2003 18:54:51 +0200
- To: "Rick Jelliffe" <ricko@topologi.com>
- CC: www-tag@w3.org
On Tuesday, April 15, 2003, 5:11:44 PM, Rick wrote: RJ> Tim Bray wrote RJ> | >> I think we're kind of stuck with the C1 chars based on them having been >> allowed in XML 1.0. RJ> I think this is not what Tim meant to write, but nevertheless: RJ> 1) The C1 characters in Unicode 3.1 are not the same characters as in Unicode 2. RJ> Now that characters have been allocated to those code points, OK and I said that, too, but luckily I thought to check with Mark Davis, the chair of the Unicode consortium about that. And I was wrong. He said: MD> Ah. The character *names* are actually undefined, and simply MD> marked by "<control>". What you are thinking of as the names are MD> simply aliases pointing to the ISO 6429 usage. See MD> http://www.unicode.org/charts/PDF/U0080.pdf. MD> So the Unicode Standard does not define U+0082 to mean "BREAK MD> PERMITTED HERE". It just says that there is a control code, one MD> which in ISO 6429 has that name and meaning. But implementers of MD> the Unicode Standard are not required to interpret the U+0082 in MD> the ISO 6429 way. MD> As I said, there are a few exceptions, listed below, where the MD> standard does assign semantics. But it does not give the vast MD> majority any semantics at all. So, they are mainly just 'unidentified control characters' with no defined semantics in terms of Unicode. RJ> 2) It was entirely legitimate for XML processors to strip out the RJ> controls before they even reached the parser, at the transcoding stage, RJ> because they belong to protocols not text. Thats an interesting point. RJ> As I mentioned, MSXML 4 did this with encoding 8859-1, because the RJ> controls are not defined as part of 8859-1 (i.e. it is perfectly RJ> fine, and indeed good, if a transcoder strips them from incoming RJ> 8859-1 text). -- Chris mailto:chris@w3.org
Received on Tuesday, 15 April 2003 12:55:08 UTC