Re: question about IRI spec from Jeremy Carroll on 2006-01-04 (www-international@w3.org from January to March 2006)

From: Jeremy Carroll <jjc@hpl.hp.com>
Date: Wed, 04 Jan 2006 13:59:46 +0000
To: Bjoern Hoehrmann <derhoermi@gmx.net>
CC: "www-international@w3.org" <www-international@w3.org>
Message-ID: <43BBD4D2.2030108@hpl.hp.com>

Bjoern Hoehrmann wrote:
> * Jeremy Carroll wrote:
>> For B, my code does an initial pass of the characters in each component, 
>> looking for problematic characters e.g. "--" in the host, or "/./" in 
>> the path. If it finds such problematic characters it may trigger more 
>> expensive processing (e.g. IDNA syntax checking). What are the 
>> characters I should be looking for in the component? i.e. please suggest 
>> a set of characters is such that if none of these characters is in the 
>> IRI then it is necessarily in NKFC? An example would be the set 
>> [^\x20-\x7F] which would at least allow me to avoid NKFC checking for 
>> URIs. Again I am expecting an answer in terms of some table from 
>> unicode.org. e.g. if each character is neither a compatibility character 
>> nor a composing character then the component is in NKFC.
> 
> http://www.unicode.org/unicode/reports/tr15/ has a quickCheck function
> for that. I guess libraries like ICU already offer something like it.

wonderful:

I got to:
http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
with entries like:
0BBE          ; NFKC_QC; M # Mc       TAMIL VOWEL SIGN AA

which is just what I need.

Thanks muchly

Jeremy

Received on Wednesday, 4 January 2006 14:01:01 UTC