Re: question about IRI spec from Bjoern Hoehrmann on 2006-01-04 (www-international@w3.org from January to March 2006)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Wed, 04 Jan 2006 13:18:06 +0100
To: Jeremy Carroll <jjc@hpl.hp.com>
Cc: "www-international@w3.org" <www-international@w3.org>
Message-ID: <92fnr11167kdnm2o6a47t8bpj9oakb2apj@hive.bjoern.hoehrmann.de>

* Jeremy Carroll wrote:
>For B, my code does an initial pass of the characters in each component, 
>looking for problematic characters e.g. "--" in the host, or "/./" in 
>the path. If it finds such problematic characters it may trigger more 
>expensive processing (e.g. IDNA syntax checking). What are the 
>characters I should be looking for in the component? i.e. please suggest 
>a set of characters is such that if none of these characters is in the 
>IRI then it is necessarily in NKFC? An example would be the set 
>[^\x20-\x7F] which would at least allow me to avoid NKFC checking for 
>URIs. Again I am expecting an answer in terms of some table from 
>unicode.org. e.g. if each character is neither a compatibility character 
>nor a composing character then the component is in NKFC.

http://www.unicode.org/unicode/reports/tr15/ has a quickCheck function
for that. I guess libraries like ICU already offer something like it.
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Received on Wednesday, 4 January 2006 15:04:24 UTC