Re: question about IRI spec

Bjoern Hoehrmann wrote:
> * Jeremy Carroll wrote:
>> For B, my code does an initial pass of the characters in each component, 
>> looking for problematic characters e.g. "--" in the host, or "/./" in 
>> the path. If it finds such problematic characters it may trigger more 
>> expensive processing (e.g. IDNA syntax checking). What are the 
>> characters I should be looking for in the component? i.e. please suggest 
>> a set of characters is such that if none of these characters is in the 
>> IRI then it is necessarily in NKFC? An example would be the set 
>> [^\x20-\x7F] which would at least allow me to avoid NKFC checking for 
>> URIs. Again I am expecting an answer in terms of some table from 
>> unicode.org. e.g. if each character is neither a compatibility character 
>> nor a composing character then the component is in NKFC.
> 
> http://www.unicode.org/unicode/reports/tr15/ has a quickCheck function
> for that. I guess libraries like ICU already offer something like it.

wonderful:

I got to:
http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
with entries like:
0BBE          ; NFKC_QC; M # Mc       TAMIL VOWEL SIGN AA

which is just what I need.

Thanks muchly

Jeremy

Received on Wednesday, 4 January 2006 14:01:01 UTC