- From: Jeremy Carroll <jjc@hpl.hp.com>
- Date: Wed, 04 Jan 2006 11:53:30 +0000
- To: "www-international@w3.org" <www-international@w3.org>
I am implementing RFC 3987.
(BSD style license, Java, part of Jena semantic web framework, motivated
by SPARQL dependency on RFC 3987)
Two parts I would like advice with are the following:
http://www.apps.ietf.org/rfc/rfc3987.html#sec-7.5
A: single script
[[
To avoid such cases, only IRIs should be created where all the
characters in a single component are used together in a given language.
This usually means that all of these characters will be from the same
script, but there are languages that mix characters from different
scripts (such as Japanese).
]]
B: NKFC
[[
Although there may be exceptions, newly created resource names should
generally be in NFKC [UTR15] (which means that they are also in NFC).
]]
On A, could someone articulate that in an automatable fashion please.
e.g. use such and such a table from unicode.org, and for each IRI
component map each character to its script code, and then the component
is OK if the set of script codes used is either a singleton set, or the
set { hiragana, kanji, katakana } or { ... }.
For B, my code does an initial pass of the characters in each component,
looking for problematic characters e.g. "--" in the host, or "/./" in
the path. If it finds such problematic characters it may trigger more
expensive processing (e.g. IDNA syntax checking). What are the
characters I should be looking for in the component? i.e. please suggest
a set of characters is such that if none of these characters is in the
IRI then it is necessarily in NKFC? An example would be the set
[^\x20-\x7F] which would at least allow me to avoid NKFC checking for
URIs. Again I am expecting an answer in terms of some table from
unicode.org. e.g. if each character is neither a compatibility character
nor a composing character then the component is in NKFC.
Given the weak language in both these assertions, violations would by
default produce warnings.
(I suspect I will send further messages about bidi)
thanks in advance
Jeremy
Received on Wednesday, 4 January 2006 12:06:49 UTC