W3C home > Mailing lists > Public > www-international@w3.org > January to March 2006

question about IRI spec

From: Jeremy Carroll <jjc@hpl.hp.com>
Date: Wed, 04 Jan 2006 11:53:30 +0000
Message-ID: <43BBB73A.60705@hpl.hp.com>
To: "www-international@w3.org" <www-international@w3.org>

I am implementing RFC 3987.
(BSD style license, Java, part of Jena semantic web framework, motivated 
by SPARQL dependency on RFC 3987)

Two parts I would like advice with are the following:

A: single script
To avoid such cases, only IRIs should be created where all the 
characters in a single component are used together in a given language. 
This usually means that all of these characters will be from the same 
script, but there are languages that mix characters from different 
scripts (such as Japanese).

Although there may be exceptions, newly created resource names should 
generally be in NFKC [UTR15] (which means that they are also in NFC).

On A, could someone articulate that in an automatable fashion please.
e.g. use such and such a table from unicode.org, and for each IRI 
component map each character to its script code, and then the component 
is OK if the set of script codes used is either a singleton set, or the 
set { hiragana, kanji, katakana } or { ... }.

For B, my code does an initial pass of the characters in each component, 
looking for problematic characters e.g. "--" in the host, or "/./" in 
the path. If it finds such problematic characters it may trigger more 
expensive processing (e.g. IDNA syntax checking). What are the 
characters I should be looking for in the component? i.e. please suggest 
a set of characters is such that if none of these characters is in the 
IRI then it is necessarily in NKFC? An example would be the set 
[^\x20-\x7F] which would at least allow me to avoid NKFC checking for 
URIs. Again I am expecting an answer in terms of some table from 
unicode.org. e.g. if each character is neither a compatibility character 
nor a composing character then the component is in NKFC.

Given the weak language in both these assertions, violations would  by 
default produce warnings.

(I suspect I will send further messages about bidi)

thanks in advance

Received on Wednesday, 4 January 2006 12:06:49 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 22:04:23 UTC