RE: Fwd: Re: HRRIs, IRIs, etc from Martin Duerst on 2007-06-27 (www-xml-linking-comments@w3.org from April to June 2007)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Wed, 27 Jun 2007 19:38:22 +0900
To: Richard Tobin <richard@inf.ed.ac.uk>, "Grosso, Paul" <pgrosso@ptc.com>
Cc: <public-iri@w3.org>, "Richard Ishida" <ishida@w3.org>, "Felix Sasaki" <fsasaki@w3.org>, <www-xml-linking-comments@w3.org>, <public-xml-core-wg@w3.org>, public-i18n-core@w3.org
Message-Id: <6.0.0.20.2.20070627193357.08f40e40@localhost>

Hello Richard,

Very good catch, thanks. John Cowan has mostly already explained
things, I don't have much to add.

The problem with Unicode is that there are (on purpose) very very
many characters. Both really needed characters, but also oddities.
In other cases, it might be easy to draw the line between useful
and oddity clearly, but the large number of characters/code points
effectively means that it's a slippery slope, and therefore different
specs easily get out of sync.

But I guess that in this area, it would also be possible to adapt
the IRI spec slightly, if there are very specific preferences
from the XML side.

Regards,    Martin.

At 18:23 07/06/26, Richard Tobin wrote:
>Can I clarify the status of some characters of the characters Martin
>listed, please?
>
>> http://www.w3.org/TR/REC-xml/#charsets allows (although, at least
>> in never versions, discourages):
>> [#xFDD0-#xFDDF],
>> [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
>> [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
>> [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
>> [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
>> [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
>> [#x10FFFE-#x10FFFF]
>> 
>> In the IRI spec, these are excluded:
>>    ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
>>                   / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
>>                   / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
>>                   / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
>>                   / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
>>                   / %xD0000-DFFFD / %xE1000-EFFFD
>
>I see XML discourages FDD*, but the ucschar excludes both FDD* and
>FDE*.  Does anyone know the reason for this discrepancy?  FDE* seem to
>be also "not a character".
>
>ucschar also excludes E0***, which seem to be "tags" - what does that
>mean?
>
>ucschar also exclude FFF*, but XML makes no mention of them, except
>of course FFFE and FFFF which aren't allowed in XML at all.
>
>-- Richard

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp

Received on Thursday, 28 June 2007 01:10:13 UTC