- From: Jeremy Carroll <jjc@hpl.hp.com>
- Date: Wed, 04 Jan 2006 11:53:30 +0000
- To: "www-international@w3.org" <www-international@w3.org>
I am implementing RFC 3987. (BSD style license, Java, part of Jena semantic web framework, motivated by SPARQL dependency on RFC 3987) Two parts I would like advice with are the following: http://www.apps.ietf.org/rfc/rfc3987.html#sec-7.5 A: single script [[ To avoid such cases, only IRIs should be created where all the characters in a single component are used together in a given language. This usually means that all of these characters will be from the same script, but there are languages that mix characters from different scripts (such as Japanese). ]] B: NKFC [[ Although there may be exceptions, newly created resource names should generally be in NFKC [UTR15] (which means that they are also in NFC). ]] On A, could someone articulate that in an automatable fashion please. e.g. use such and such a table from unicode.org, and for each IRI component map each character to its script code, and then the component is OK if the set of script codes used is either a singleton set, or the set { hiragana, kanji, katakana } or { ... }. For B, my code does an initial pass of the characters in each component, looking for problematic characters e.g. "--" in the host, or "/./" in the path. If it finds such problematic characters it may trigger more expensive processing (e.g. IDNA syntax checking). What are the characters I should be looking for in the component? i.e. please suggest a set of characters is such that if none of these characters is in the IRI then it is necessarily in NKFC? An example would be the set [^\x20-\x7F] which would at least allow me to avoid NKFC checking for URIs. Again I am expecting an answer in terms of some table from unicode.org. e.g. if each character is neither a compatibility character nor a composing character then the component is in NKFC. Given the weak language in both these assertions, violations would by default produce warnings. (I suspect I will send further messages about bidi) thanks in advance Jeremy
Received on Wednesday, 4 January 2006 12:06:49 UTC