Re: IRI with compatibility character, unwise? from Tex Texin on 2005-10-03 (www-international@w3.org from October to December 2005)

From: Tex Texin <tex@xencraft.com>
Date: Mon, 03 Oct 2005 16:07:04 -0700
To: Jeremy Carroll <jjc@hplb.hpl.hp.com>
CC: www-international@w3.org
Message-ID: <4341B998.E1773919@xencraft.com>

Jeremy,

You have it correct that compatibility characters should be avoided in IRI.
They are outright invalid for domain names.

However, the technique for avoiding compatibility characters through the
application of Normalization Form NFKC, is to substitute a Unicode character
or characters which represents the compatibility character absent of the
typographic or other stylistic differences.
So  the ligature for "fi" becomes the two characters "f" and "i", fullwidth
characters are represented by halfwidth equivalents, superscripted
characters are represented by the plain characters, etc.

You might determine that some characters are invalid for your purposes and
recommend they simply be removed, the spec for domain names does that as
well. However, simply applying NFKC and allowing substitution of
representative characters will make more IRI distinct and mnemonic.
e.g. Both AB and A&B map to AB if you remove ampersands.

tex

Jeremy Carroll wrote:
> 
> Hello
> 
> I had a support question for the Jena Semantic Web software, concerning
> the following RDF URI Reference:
> 
> http://ontology.tos.co.jp/#\u304A\u3082\u3061\u3083\uFF06\u30DB\u30D3\u30FC
> 
> where the \u escapes denote the unicode characters.
> 
> The initial problem was that this was input with the rdf:ID syntax, and
> that "\u304A\u3082\u3061\u3083\uFF06\u30DB\u30D3\u30FC" is not an XML
> Name because of the half-width ampersand "\uFF06", which I note is a
> compatibility character.
> 
> The XML recommendation says:
> [[
> Characters in the compatibility area (i.e. with character code greater
> than #xF900 and less than #xFFFE) are not allowed in XML names.
> ]]
> 
> On further reading, I saw in RFC 3987 that:
> 
> http://www.ietf.org/rfc/rfc3987.txt
> [[
> On the other hand, in some cases, the UCS contains
>     variants for compatibility reasons; for example, for typographic
>     purposes.  These should be avoided wherever possible.  Although there
>     may be exceptions, newly created resource names should generally be
>     in NFKC [UTR15]
> ]]
> While not being familiar with the concept of NFKC, I believe this means
> that compatibility characters should be avoided when creating a new IRI.
> Since the document was creating this IRI, I advised that it should be
> changed (e.g. by deleting the half-width ampersand)
> 
> Presumably a different change would be to use a normal ampersand "&",
> which is legal in an IRI fragment, and not one to avoid when creating a
> new IRI. (Although illegal in an XML Name, for which there is a work-around)
> 
> Have I understood correctly?
> 
> Jeremy

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com

XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------

Received on Monday, 3 October 2005 23:07:16 UTC