- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Thu, 05 Jan 2006 11:39:33 +0900
- To: Jeremy Carroll <jjc@hpl.hp.com>, "www-international@w3.org" <www-international@w3.org>
- Cc: public-iri@w3.org
Hello Jeremy,
[Cross-posting the IRI mailing list.]
At 20:53 06/01/04, Jeremy Carroll wrote:
>
>
>I am implementing RFC 3987.
>(BSD style license, Java, part of Jena semantic web framework, motivated
by SPARQL dependency on RFC 3987)
>
>Two parts I would like advice with are the following:
>
>http://www.apps.ietf.org/rfc/rfc3987.html#sec-7.5
Please note that Section 7 as a whole is entitled
URI/IRI Processing Guidelines (informative), and although it contains words
such as 'should' and 'recommended', these are not upper-cased, and therefore
not RFC 2119. For the next version of the spec, we may want to think about
changing the language so that these words aren't used anymore.
Another caveat is that subsection 7.5 is talking about IRI creation.
It is important that, e.g. in a Java context, this isn't confused with
the creation of an object of an IRI class. Most of these objects will
be created from pre-existing IRIs, to which these considerations don't
apply directly (there may still be rather similar security considerations
that may apply).
>A: single script
>[[
>To avoid such cases, only IRIs should be created where all the characters
in a single component are used together in a given language. This usually
means that all of these characters will be from the same script, but there
are languages that mix characters from different scripts (such as Japanese).
>]]
>
>On A, could someone articulate that in an automatable fashion please.
>e.g. use such and such a table from unicode.org, and for each IRI
component map each character to its script code, and then the component is
OK if the set of script codes used is either a singleton set, or the set {
hiragana, kanji, katakana } or { ... }.
John Cowan, in
http://lists.w3.org/Archives/Public/www-international/2006JanMar/0003.html,
already gave a pointer to http://www.unicode.org/Public/UNIDATA/Scripts.txt
as well as some more language examples. Please also have a look through
http://www.unicode.org/reports/tr36/ (Unicode Security Considerations) and
http://www.unicode.org/reports/tr39/tr39-1.html (Unicode Security Mechanisms,
proposed draft). Although creation (you try to create an IRI that survives
well) is different from security (others try to avoid IRIs that may make them
believe they are something different than they are), the actual data/algorithms
needed for detection are very close.
>B: NKFC
>[[
>Although there may be exceptions, newly created resource names should
generally be in NFKC [UTR15] (which means that they are also in NFC).
>]]
>
>For B, my code does an initial pass of the characters in each component,
looking for problematic characters e.g. "--" in the host, or "/./" in the
path. If it finds such problematic characters it may trigger more expensive
processing (e.g. IDNA syntax checking). What are the characters I should be
looking for in the component? i.e. please suggest a set of characters is
such that if none of these characters is in the IRI then it is necessarily
in NKFC? An example would be the set [^\x20-\x7F] which would at least
allow me to avoid NKFC checking for URIs. Again I am expecting an answer in
terms of some table from unicode.org. e.g. if each character is neither a
compatibility character nor a composing character then the component is in NKFC.
Bjoern already pointed to
http://lists.w3.org/Archives/Public/www-international/2006JanMar/0002.html.
Why do you want to avoid NKFC checking? Ideally, you would use an NKFC
checking implementation that did a quick first pass internally, wouldn't you?
>Given the weak language in both these assertions, violations would by
default produce warnings.
I think getting your API right in this respect is probably the most
challenging. Contrary to end-user software, you have to rely on the
API user to do the right thing with the warning. This requires quite
a bit of knowledge to not make wrong decisions (e.g. just turning
all warnings into errors, or just ignoring all warnings)
that will hurt users.
>(I suspect I will send further messages about bidi)
Please do! Regards, Martin.
Received on Thursday, 5 January 2006 04:53:14 UTC