Re: question about IRI spec from Jeremy Carroll on 2006-01-10 (public-iri@w3.org from January 2006)

From: Jeremy Carroll <jjc@hpl.hp.com>
Date: Tue, 10 Jan 2006 13:58:40 -0500
To: Martin Duerst <duerst@it.aoyama.ac.jp>
Cc: "www-international@w3.org" <www-international@w3.org>, public-iri@w3.org
Message-ID: <43C3CF3E.7090209@hpl.hp.com>

Martin Duerst wrote:

> Please note that Section 7 as a whole is entitled
> URI/IRI Processing Guidelines (informative), and although it contains words
> such as 'should' and 'recommended', these are not upper-cased, and 
> therefore
> not RFC 2119. For the next version of the spec, we may want to think about
> changing the language so that these words aren't used anymore.

I think in both RFC 3987 and RFC 3986 there are three effective levels 
of instruction: MUST force, SHOULD force, and something I am calling 
minting force (i.e. applied on generation not receipt).

There are also a separate set of issues that are security related.

There is a further issue as to whether DNS style names are expected or 
not. (As I delve into it, this seems largely scheme driven rather than 
anything else. Most of the common schemes seem to either not use the 
generic syntax (mailto, urn) or explicitly use productions hostport or 
host and hence commit to DNS syntax for names (and hence IDNA for 
international names, since percent encoding is prohibited)

The current sketch API
(see http://jena.sourceforge.net/tmp/javadoc)

In it, the first step is to configure an IRIFactory to treat the various 
forces as errors or warnings depending on the application.

Thanks for the other pointers ... a further comment:
> Why do you want to avoid NKFC checking? Ideally, you would use an NKFC
> checking implementation that did a quick first pass internally, wouldn't 
> you?

I want the cost of checking a typical IRI to be minimal.
Most will be plain ascii.

In a typical application using Jena there are many many thousands of 
IRIs, many of which we don't check at the moment. If we switch IRI 
checking on, we don't want to have a noticeable slowdown.

The question is how few passes of the string can achieve the goal.

I currently have quite a few:

1) split the string according to the non-validating regex from the 
appendices of RFC 3986

2) parse each component according to an error-checking grammar based on 
RFC 3986. (Implemented as FSM to one pass of the string)

3) If an illegal char (where a percent encoded char would have been 
acceptable) error occurred, then do another pass looking at the actual 
non-ascii chars used

> 
> I think getting your API right in this respect is probably the most
> challenging. Contrary to end-user software, you have to rely on the
> API user to do the right thing with the warning. This requires quite
> a bit of knowledge to not make wrong decisions (e.g. just turning
> all warnings into errors, or just ignoring all warnings)
> that will hurt users.

Hmmmm, the API design is hard. I try to make the error/warning 
distinction configurable according to the force and intent behind the 
violated text.

Jeremy

Received on Tuesday, 10 January 2006 18:58:51 UTC