W3C home > Mailing lists > Public > www-international@w3.org > January to March 2006

question about IRI spec

From: Jeremy Carroll <jjc@hpl.hp.com>
Date: Wed, 04 Jan 2006 11:53:30 +0000
Message-ID: <43BBB73A.60705@hpl.hp.com>
To: "www-international@w3.org" <www-international@w3.org>


I am implementing RFC 3987.
(BSD style license, Java, part of Jena semantic web framework, motivated 
by SPARQL dependency on RFC 3987)

Two parts I would like advice with are the following:

http://www.apps.ietf.org/rfc/rfc3987.html#sec-7.5
A: single script
[[
To avoid such cases, only IRIs should be created where all the 
characters in a single component are used together in a given language. 
This usually means that all of these characters will be from the same 
script, but there are languages that mix characters from different 
scripts (such as Japanese).
]]

B: NKFC
[[
Although there may be exceptions, newly created resource names should 
generally be in NFKC [UTR15] (which means that they are also in NFC).
]]

On A, could someone articulate that in an automatable fashion please.
e.g. use such and such a table from unicode.org, and for each IRI 
component map each character to its script code, and then the component 
is OK if the set of script codes used is either a singleton set, or the 
set { hiragana, kanji, katakana } or { ... }.

For B, my code does an initial pass of the characters in each component, 
looking for problematic characters e.g. "--" in the host, or "/./" in 
the path. If it finds such problematic characters it may trigger more 
expensive processing (e.g. IDNA syntax checking). What are the 
characters I should be looking for in the component? i.e. please suggest 
a set of characters is such that if none of these characters is in the 
IRI then it is necessarily in NKFC? An example would be the set 
[^\x20-\x7F] which would at least allow me to avoid NKFC checking for 
URIs. Again I am expecting an answer in terms of some table from 
unicode.org. e.g. if each character is neither a compatibility character 
nor a composing character then the component is in NKFC.

Given the weak language in both these assertions, violations would  by 
default produce warnings.

(I suspect I will send further messages about bidi)

thanks in advance

Jeremy
Received on Wednesday, 4 January 2006 12:06:49 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:06 GMT