- From: Tex Texin <tex@xencraft.com>
- Date: Mon, 03 Oct 2005 16:07:04 -0700
- To: Jeremy Carroll <jjc@hplb.hpl.hp.com>
- CC: www-international@w3.org
Jeremy, You have it correct that compatibility characters should be avoided in IRI. They are outright invalid for domain names. However, the technique for avoiding compatibility characters through the application of Normalization Form NFKC, is to substitute a Unicode character or characters which represents the compatibility character absent of the typographic or other stylistic differences. So the ligature for "fi" becomes the two characters "f" and "i", fullwidth characters are represented by halfwidth equivalents, superscripted characters are represented by the plain characters, etc. You might determine that some characters are invalid for your purposes and recommend they simply be removed, the spec for domain names does that as well. However, simply applying NFKC and allowing substitution of representative characters will make more IRI distinct and mnemonic. e.g. Both AB and A&B map to AB if you remove ampersands. tex Jeremy Carroll wrote: > > Hello > > I had a support question for the Jena Semantic Web software, concerning > the following RDF URI Reference: > > http://ontology.tos.co.jp/#\u304A\u3082\u3061\u3083\uFF06\u30DB\u30D3\u30FC > > where the \u escapes denote the unicode characters. > > The initial problem was that this was input with the rdf:ID syntax, and > that "\u304A\u3082\u3061\u3083\uFF06\u30DB\u30D3\u30FC" is not an XML > Name because of the half-width ampersand "\uFF06", which I note is a > compatibility character. > > The XML recommendation says: > [[ > Characters in the compatibility area (i.e. with character code greater > than #xF900 and less than #xFFFE) are not allowed in XML names. > ]] > > On further reading, I saw in RFC 3987 that: > > http://www.ietf.org/rfc/rfc3987.txt > [[ > On the other hand, in some cases, the UCS contains > variants for compatibility reasons; for example, for typographic > purposes. These should be avoided wherever possible. Although there > may be exceptions, newly created resource names should generally be > in NFKC [UTR15] > ]] > While not being familiar with the concept of NFKC, I believe this means > that compatibility characters should be avoided when creating a new IRI. > Since the document was creating this IRI, I advised that it should be > changed (e.g. by deleting the half-width ampersand) > > Presumably a different change would be to use a normal ampersand "&", > which is legal in an IRI fragment, and not one to avoid when creating a > new IRI. (Although illegal in an XML Name, for which there is a work-around) > > Have I understood correctly? > > Jeremy -- ------------------------------------------------------------- Tex Texin cell: +1 781 789 1898 mailto:Tex@XenCraft.com Xen Master http://www.i18nGuy.com XenCraft http://www.XenCraft.com Making e-Business Work Around the World -------------------------------------------------------------
Received on Monday, 3 October 2005 23:07:16 UTC