Re: ?$B%!?(B2.1.3 IRI/URI Canonicalization does not address IRIs with IDNs

Thanks Martin, answers inline.

> At 17:14 08/04/10, Phil Archer wrote:
>
>>Our basic need is that we must be able to be certain whether a given IRI
>> does or does not match a small data set. Typically, something like
>
> A very, very basic question: Just for a moment assuming that a complete
> solution isn't possible, what's more of a problem for your application:
>
> False positives (IRIs/URIs match when they shouldn't) or
> false negatives (IRIs/URIs don't match when they should)?
>
> If you can't decide on one or the other, can you at least describe
> potential consequences in each case?

That's a very interesting question that as far as I know we've not thought
of, believing (naively) that a complete solution was possible. Now, if
that is not the case then that actually makes life easier. Unless someone
can think of a counter argument, I think it's always the case that we'd
want to err on the side of caution, i.e. false nagatives are always
preferable to false positives.

POWDER is about descibing lots of resources at once - everything on
example.org is red and square being our generic example. If I make such an
assertion, it's better that there are some cases where my claim that
things are red and square is not recognised than that my claim be applied
to resources I may know nothing about.

Also, in my conversation with Eric P the otherr day I was a little
concerned by his saying that the kind of canonicalisatiion you carry out
really depends where you are in the chain - UI level,  Network level etc.

Now... if we can _ligitimately_ say that there are circumstances where
canonicalisation is not always possible, that allows us to change the
tenor of the text to say that applications should make a _best effort_ to
canonicalise and then give a series of possible steps to take. The ones
that are concrete, OK, do them, ones that are less prescriptive may lead
to a false negative or positive and POWDER publishers should be aware of
this and create data accordingly.

This, for example:

<iriset>
  <includehosts>xn--exmpless-jua.org exåmpless.org</includehosts>
</iriset>

means anything on exåmpless.org OR xn--exmpless-jua.org, so, I think I
understand that this might lead to a false positive since we can't be sure
that the double s is just that or an Eszett. If I'm right then we'd
caution _against_ doing this and say just quote exåmpless.org (and make
sure that the XML file really was in UTF-8 and served with the correct
HTTP headers and so on)

>
> Also, do you describe/talk about/work with actual retreivable
> resources, or also others?

Hmmm... this has been a tricky one. In the end we actually talk about IRI
sets and say that the descriptions may be applied to all resources that
are dereferenced from any IRI that is a member of the set. We say that we
don't limit what kind of IRI is used but we make it easy to use http-style
ones which is what our use cases are about.

Phil.

Received on Thursday, 10 April 2008 13:01:17 UTC