question about IRI spec

I am implementing RFC 3987.
(BSD style license, Java, part of Jena semantic web framework, motivated 
by SPARQL dependency on RFC 3987)

Two parts I would like advice with are the following:

http://www.apps.ietf.org/rfc/rfc3987.html#sec-7.5
A: single script
[[
To avoid such cases, only IRIs should be created where all the 
characters in a single component are used together in a given language. 
This usually means that all of these characters will be from the same 
script, but there are languages that mix characters from different 
scripts (such as Japanese).
]]

B: NKFC
[[
Although there may be exceptions, newly created resource names should 
generally be in NFKC [UTR15] (which means that they are also in NFC).
]]

On A, could someone articulate that in an automatable fashion please.
e.g. use such and such a table from unicode.org, and for each IRI 
component map each character to its script code, and then the component 
is OK if the set of script codes used is either a singleton set, or the 
set { hiragana, kanji, katakana } or { ... }.

For B, my code does an initial pass of the characters in each component, 
looking for problematic characters e.g. "--" in the host, or "/./" in 
the path. If it finds such problematic characters it may trigger more 
expensive processing (e.g. IDNA syntax checking). What are the 
characters I should be looking for in the component? i.e. please suggest 
a set of characters is such that if none of these characters is in the 
IRI then it is necessarily in NKFC? An example would be the set 
[^\x20-\x7F] which would at least allow me to avoid NKFC checking for 
URIs. Again I am expecting an answer in terms of some table from 
unicode.org. e.g. if each character is neither a compatibility character 
nor a composing character then the component is in NKFC.

Given the weak language in both these assertions, violations would  by 
default produce warnings.

(I suspect I will send further messages about bidi)

thanks in advance

Jeremy

Received on Wednesday, 4 January 2006 12:06:49 UTC