- From: Jeremy Carroll <jjc@hpl.hp.com>
- Date: Wed, 04 Jan 2006 13:59:46 +0000
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- CC: "www-international@w3.org" <www-international@w3.org>
Bjoern Hoehrmann wrote: > * Jeremy Carroll wrote: >> For B, my code does an initial pass of the characters in each component, >> looking for problematic characters e.g. "--" in the host, or "/./" in >> the path. If it finds such problematic characters it may trigger more >> expensive processing (e.g. IDNA syntax checking). What are the >> characters I should be looking for in the component? i.e. please suggest >> a set of characters is such that if none of these characters is in the >> IRI then it is necessarily in NKFC? An example would be the set >> [^\x20-\x7F] which would at least allow me to avoid NKFC checking for >> URIs. Again I am expecting an answer in terms of some table from >> unicode.org. e.g. if each character is neither a compatibility character >> nor a composing character then the component is in NKFC. > > http://www.unicode.org/unicode/reports/tr15/ has a quickCheck function > for that. I guess libraries like ICU already offer something like it. wonderful: I got to: http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt with entries like: 0BBE ; NFKC_QC; M # Mc TAMIL VOWEL SIGN AA which is just what I need. Thanks muchly Jeremy
Received on Wednesday, 4 January 2006 14:01:01 UTC