Re: Absolute IRIs (Was: Re: IRI guidance)

Eric Prud'hommeaux wrote:
> * Alex Hall <alexhall@revelytix.com> [2011-04-29 09:42-0400]
>> On Fri, Apr 29, 2011 at 8:14 AM, Eric Prud'hommeaux <eric@w3.org> wrote:
>>
>>> * Ivan Herman <ivan@w3.org> [2011-04-29 08:24+0200]
>>>> On Apr 28, 2011, at 23:59 , Eric Prud'hommeaux wrote:
>>>> <snip/>
>>>>>> Unfortunately this can lead to unexpected consequences, such as an
>>>>>> application dereferencing the IRI http://xn--rsum-bpad.example.org(not sure
>>>>>> how GMail will escape that -- that's the punycode version) and getting
>>> a
>>>>>> document with a description of some resource with IRI
>>>>>> http://résumé.example.org <http://xn--rsum-bpad.example.org> <
>>> http://xn--rsum-bpad.example.org> (Unicode
>>>>>> version).  To help prevent this, we could discourage the use of IRIs
>>> with
>>>>>> encoded IDNs in RDF, similar to how the existing spec discourages the
>>> use of
>>>>>> URI Refs with percent-escaped characters.
>>>>> I think this leads down the path of not using IRIs. When dereferencing
>>>>> an HTTP IRI, one has to punyify the domain name and percentulate the
>>>>> path, mapping http://伝言.example/?user=أكرم<http://xn--9oqp94l.example/?user=%D8%A3%D9%83%D8%B1%D9%85>to
>>>>> http://xn--9oqp94l.example/?user=%D8%A3%D9%83%D8%B1%D9%85 . Any IRI
>>>>> with characters outside of the legal URI characters will map to a
>>>>> differently spelled URI, necessitating some typing of these respective
>>>>> strings. If we're taking away the sharp knives, we'll have to take
>>>>> away non-ascii characters and díäcrìtïcâl markç.
>>>> Eric, I am not sure I understand that. The proposal is to say that, in
>>> RDF, there should be a preference for the UTF version of the URI-s, ie, I
>>> should, if possible, opt for http://伝言.example/?user=أكرم<http://xn--9oqp94l.example/?user=%D8%A3%D9%83%D8%B1%D9%85>rather than the the other version. What happens underneath if I dereference
>>> that URI and send it to tools for an HTTP get or anything similar is a
>>> separate issue. Indeed, on an English keyboard typing something even as
>>> simple as http://iván.herman.net <http://xn--ivn-fla.herman.net> is a pain
>>> for a user, but that is a practical problem which is again outside the realm
>>> of RDF.
>>>
>>> Ahh, I interpreted "discourage … encoded IDNs" as discouraging
>>> UTF-8-encoded IRIs while the intent was discouraging punycode-encoded.
>>> Sorry.
>>>
>>>
>> No worries -- "encoded" is too vague a term, I should have been more
>> specific.
>>
>>
>>>> Ie: saying that we keep to the current version of RDF, ie, equality of
>>> IRI-s is based on a character-by-character comparison (like now) but giving
>>> an advice to, if possible, use the IRI without the punycode seems to be a
>>> reasonable way of handling this... What else would you propose instead?
>>>
>>> I'm all for character-by-character comparison. I think the emphasis should
>>> be on keeping track of the type. Here's a draft of a minimal change to the
>>> Concepts document:
>>> [[
>>> 6.2 RDF Graph
>>> An RDF triple contains three components:
>>>
>>>    * the subject, which is an IRI or a blank node
>>>    * the predicate, which is an IRI
>>>    * the object, which is an IRI, a literal or a blank node
>>> …
>>> 6.4 IRI
>>>
>>> An IRI within an RDF graph (an RDF URI reference) is a Unicode string
>                                ^^^^^^^^^^^^^^^^^^^^^^
>>> [UNICODE] that conforms to the definition of an IRI in RFC2397 [IRI].
>>> Implementations may issue warnings concerning the use of RDF terms
>>> designated to be IRIs but which are not conformant to the IRI
>>> definition.
>>>
>> I wonder if it's too confusing to mention IRI and RDF URI reference in the
>> same breath, in the very first sentence no less?  I'd prefer to keep URIs
>> out of the discussion as much as possible.
> 
> oops, pasto. intended just "An IRI within an RDF graph is a Unicode
> string".
> 
> 
>>> Note: RFC2397 Section 3.1. "Mapping of IRIs to URIs" specifies the
>>> mapping to URIs, which must be done, for instance, when constructing
>>> an HTTP GET request. This specification does not define a relationship
>>> between an IRI and the URI to which it is mapped.
>>>
>>> Note: RFC2397 Section 5.3.1. "Simple String Comparison" specifies
>>> equivalence for IRIs used as identity tokes, as they are in RDF
>>> graphs.
>>>
>>> Note: IRIs are compatible with the anyURI datatype as defined by XML
>>> schema datatypes [XML-SCHEMA2], constrained to be an absolute rather
>>> than a relative URI reference.
>>>
>>> Note: IRIs are compatible with International Resource Identifiers as
>>> defined by [XML Namespaces 1.1].
>>>
>>> Note: The restriction to absolute IRIs is found in this abstract
>>> syntax. When there is a well-defined base, concrete syntaxes, such as
>>> RDF/XML, may permit relative IRIs as a shorthand for such absolute IRIs.
>>> ]]
>>>
>> I think this part could use some clarification.  An IRI is, by definition,
>> absolute per section 2.2 of RFC3987.  IRI references may be absolute or
>> relative, but resolve to an absolute IRI (as described in section 1.3).
>>
>> To muddy the waters even further, the "absolute-IRI" grammar construct in
>> section 2.2 omits the fragment identifier, but I cannot find any references
>> to this either internal or external to the RFC.
>>
>> So I think we should (a) specifically call out out the definition in section
>> 2.2; and (b) avoid any mention of the terms "IRI reference" or "absolute
>> IRI" except in an informative context.
> 
> I'm not personally keen on this absolute IRI restriction. I included
> it in this proposal in order to minimize the permutations being
> examined at once ("minimal change"). For usability, I find
>   Data:
>     <s> <p> <o> .
>   Query:
>     ASK { ?s <p> ?o }
> 
> very intuitive when you don't have to specifically call out a base
> URI. Using IRI references instead of IRIs would permit the above query
> to work in e.g. Jena (which currently presumes absolute IRIs).

Ahh my favourite topic, it's "IRI" that we need (not absolute-IRI since 
no fragment).

   IRI           = scheme ":" ihier-part [ "?" iquery ] [ "#" ifragment ]

So we just say the value space is "IRI", and the lexical space can be 
"IRI-reference" (when coupled to a known base via serialization or a 
base pre-known to the environment you're currently working in).

Best,

Nathan

Received on Friday, 29 April 2011 17:58:30 UTC