Re: IRI guidance

Eric Prud'hommeaux wrote:
> * Alex Hall <alexhall@revelytix.com> [2011-04-28 16:16-0400]
>> On Wed, Apr 27, 2011 at 3:52 PM, Nathan <nathan@webr3.org> wrote:
>>
>>> just noticed a nice bit of text in the activity streams spec:
>>>
>>> [[
>>> This specification allows the use of IRIs [RFC3987]. Every URI [RFC3986] is
>>> also an IRI, so a URI MAY be used wherever an IRI is named. When an IRI that
>>> is not also a URI is given for dereferencing, it MUST be mapped to a URI
>>> using the steps in Section 3.1 of [RFC3987]. When an IRI is serving as an
>>> identifier, it MUST NOT be so mapped.
>>> ]]
>>>
>>>
>> This corresponds nicely with how I think IRIs should work in RDF.  When used
>> as an identifier, an IRI is simply a sequence of Unicode characters.  That
>> character sequence conforms with the grammar defined in RFC3987, but many
>> applications don't care about that; they're only interested in knowing that
>> an RDF term is an IRI, and whether two RDF terms which are IRIs are the
>> same.  A simple string comparison of the Unicode characters should be
>> sufficient to determine equivalence of resources identified by IRIs.
>>
>> If an IRI happens to be dereferenceable, and an application chooses to
>> dereference it, then they map it as a URI.  If, as part of this mapping, the
>> application encodes an IDN and finds that the encoded URI is the same as
>> another resource IRI, then it might conclude that those identify the same
>> resource.  But it should do so with the understanding that this is an
>> extension of the semantics of RDF, assuming we define IRI equivalence as a
>> simple string comparison.
>>
>> Unfortunately this can lead to unexpected consequences, such as an
>> application dereferencing the IRI http://xn--rsum-bpad.example.org (not sure
>> how GMail will escape that -- that's the punycode version) and getting a
>> document with a description of some resource with IRI
>> http://résumé.example.org <http://xn--rsum-bpad.example.org> (Unicode
>> version).  To help prevent this, we could discourage the use of IRIs with
>> encoded IDNs in RDF, similar to how the existing spec discourages the use of
>> URI Refs with percent-escaped characters.
> 
> I think this leads down the path of not using IRIs. When dereferencing
> an HTTP IRI, one has to punyify the domain name and percentulate the
> path, mapping http://伝言.example/?user=أكرم to
> http://xn--9oqp94l.example/?user=%D8%A3%D9%83%D8%B1%D9%85 . Any IRI
> with characters outside of the legal URI characters will map to a
> differently spelled URI, necessitating some typing of these respective
> strings. If we're taking away the sharp knives, we'll have to take
> away non-ascii characters and díäcrìtïcâl markç.

I wonder if it's safe to think of dereferencing as a black box and as 
none of our concern?

Personally I have to confess that I'd much rather encounter http://伝言. 
example/?user=أكرم in some RDF than 
http://xn--9oqp94l.example/?user=%D8%A3%D9%83%D8%B1%D9%85

best, nathan

Received on Thursday, 28 April 2011 22:10:01 UTC