- From: Jeremy Carroll <jjc@hpl.hp.com>
- Date: Tue, 10 Jan 2006 13:58:40 -0500
- To: Martin Duerst <duerst@it.aoyama.ac.jp>
- Cc: "www-international@w3.org" <www-international@w3.org>, public-iri@w3.org
Martin Duerst wrote: > Please note that Section 7 as a whole is entitled > URI/IRI Processing Guidelines (informative), and although it contains words > such as 'should' and 'recommended', these are not upper-cased, and > therefore > not RFC 2119. For the next version of the spec, we may want to think about > changing the language so that these words aren't used anymore. I think in both RFC 3987 and RFC 3986 there are three effective levels of instruction: MUST force, SHOULD force, and something I am calling minting force (i.e. applied on generation not receipt). There are also a separate set of issues that are security related. There is a further issue as to whether DNS style names are expected or not. (As I delve into it, this seems largely scheme driven rather than anything else. Most of the common schemes seem to either not use the generic syntax (mailto, urn) or explicitly use productions hostport or host and hence commit to DNS syntax for names (and hence IDNA for international names, since percent encoding is prohibited) The current sketch API (see http://jena.sourceforge.net/tmp/javadoc) In it, the first step is to configure an IRIFactory to treat the various forces as errors or warnings depending on the application. Thanks for the other pointers ... a further comment: > Why do you want to avoid NKFC checking? Ideally, you would use an NKFC > checking implementation that did a quick first pass internally, wouldn't > you? I want the cost of checking a typical IRI to be minimal. Most will be plain ascii. In a typical application using Jena there are many many thousands of IRIs, many of which we don't check at the moment. If we switch IRI checking on, we don't want to have a noticeable slowdown. The question is how few passes of the string can achieve the goal. I currently have quite a few: 1) split the string according to the non-validating regex from the appendices of RFC 3986 2) parse each component according to an error-checking grammar based on RFC 3986. (Implemented as FSM to one pass of the string) 3) If an illegal char (where a percent encoded char would have been acceptable) error occurred, then do another pass looking at the actual non-ascii chars used > > I think getting your API right in this respect is probably the most > challenging. Contrary to end-user software, you have to rely on the > API user to do the right thing with the warning. This requires quite > a bit of knowledge to not make wrong decisions (e.g. just turning > all warnings into errors, or just ignoring all warnings) > that will hurt users. Hmmmm, the API design is hard. I try to make the error/warning distinction configurable according to the force and intent behind the violated text. Jeremy
Received on Tuesday, 10 January 2006 18:58:51 UTC