Re: IRI guidance

* Ivan Herman <ivan@w3.org> [2011-04-29 08:24+0200]
> 
> On Apr 28, 2011, at 23:59 , Eric Prud'hommeaux wrote:
> <snip/>
> >> 
> >> Unfortunately this can lead to unexpected consequences, such as an
> >> application dereferencing the IRI http://xn--rsum-bpad.example.org (not sure
> >> how GMail will escape that -- that's the punycode version) and getting a
> >> document with a description of some resource with IRI
> >> http://résumé.example.org <http://xn--rsum-bpad.example.org> (Unicode
> >> version).  To help prevent this, we could discourage the use of IRIs with
> >> encoded IDNs in RDF, similar to how the existing spec discourages the use of
> >> URI Refs with percent-escaped characters.
> > 
> > I think this leads down the path of not using IRIs. When dereferencing
> > an HTTP IRI, one has to punyify the domain name and percentulate the
> > path, mapping http://伝言.example/?user=أكرم to
> > http://xn--9oqp94l.example/?user=%D8%A3%D9%83%D8%B1%D9%85 . Any IRI
> > with characters outside of the legal URI characters will map to a
> > differently spelled URI, necessitating some typing of these respective
> > strings. If we're taking away the sharp knives, we'll have to take
> > away non-ascii characters and díäcrìtïcâl markç.
> 
> Eric, I am not sure I understand that. The proposal is to say that, in RDF, there should be a preference for the UTF version of the URI-s, ie, I should, if possible, opt for http://伝言.example/?user=أكرم rather than the the other version. What happens underneath if I dereference that URI and send it to tools for an HTTP get or anything similar is a separate issue. Indeed, on an English keyboard typing something even as simple as http://iván.herman.net is a pain for a user, but that is a practical problem which is again outside the realm of RDF.

Ahh, I interpreted "discourage … encoded IDNs" as discouraging UTF-8-encoded IRIs while the intent was discouraging punycode-encoded. Sorry.


> Ie: saying that we keep to the current version of RDF, ie, equality of IRI-s is based on a character-by-character comparison (like now) but giving an advice to, if possible, use the IRI without the punycode seems to be a reasonable way of handling this... What else would you propose instead?

I'm all for character-by-character comparison. I think the emphasis should be on keeping track of the type. Here's a draft of a minimal change to the Concepts document:
[[
6.2 RDF Graph
An RDF triple contains three components:

    * the subject, which is an IRI or a blank node 
    * the predicate, which is an IRI
    * the object, which is an IRI, a literal or a blank node 
…
6.4 IRI

An IRI within an RDF graph (an RDF URI reference) is a Unicode string
[UNICODE] that conforms to the definition of an IRI in RFC2397 [IRI].
Implementations may issue warnings concerning the use of RDF terms
designated to be IRIs but which are not conformant to the IRI
definition.

Note: RFC2397 Section 3.1. "Mapping of IRIs to URIs" specifies the
mapping to URIs, which must be done, for instance, when constructing
an HTTP GET request. This specification does not define a relationship
between an IRI and the URI to which it is mapped.

Note: RFC2397 Section 5.3.1. "Simple String Comparison" specifies
equivalence for IRIs used as identity tokes, as they are in RDF
graphs.

Note: IRIs are compatible with the anyURI datatype as defined by XML
schema datatypes [XML-SCHEMA2], constrained to be an absolute rather
than a relative URI reference.

Note: IRIs are compatible with International Resource Identifiers as
defined by [XML Namespaces 1.1].

Note: The restriction to absolute IRIs is found in this abstract
syntax. When there is a well-defined base, concrete syntaxes, such as
RDF/XML, may permit relative IRIs as a shorthand for such absolute IRIs.
]]

Note, I changed "RDF URI reference" to "IRI" instead of "RDF IRI" as I'm not convinced that an IRI which appears in an RDF document is of a different type than an IRI which appears in an email or in the location bar of my browser.

Here I proposed saying that IRIs and their URIs are simply different things, eliding the syntactic hint
x [[
x Note: Because of the risk of confusion between RDF URI references that
x would be equivalent if derefenced, the use of %-escaped characters in
x RDF URI references is strongly discouraged. See also the URI
x equivalence issue of the Technical Architecture Group [TAG].
x ]]

I agree with Alex that punycoded domain names and %-escaped characters should be mentioned in the same breath. From a human-engineering perspective, I think any text specifying syntactic hints to help observers visually discriminate them discourages programmers from being conscientious about the distinction. However, if we want to encourage the world to mint IRIs which we can procedurally calculate from URIs (motivated perhaps by associating HTTP traffic with assertions about resources), we could add some text encouraging an unambiguous transformation:

[[
Note: RFC2397's mapping of IRIs to URIs does not alter "%25" or
punycoded domain names, which means that the IRIs
<http://伝言.example/R&D> and <http://xn--9oqp94l.example/R%25D> will
both be transformed to the URI to <http://xn--9oqp94l.example/R%25D>.
RFC2397 section 3.2. "Converting URIs to IRIs" defines a function
which produces a single IRI for any URI. When minting IRIs for RDF,
it is encouraged to mint forms which can round trip to a URI form
and back.
]]


> Cheers
> 
> Ivan
> 
> 
> > 
> > 
> >> -Alex
> > 
> > -- 
> > -ericP
> > 
> 
> 
> ----
> Ivan Herman, W3C Semantic Web Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> PGP Key: http://www.ivan-herman.net/pgpkey.html
> FOAF: http://www.ivan-herman.net/foaf.rdf
> 
> 
> 
> 
> 

-- 
-ericP

Received on Friday, 29 April 2011 12:15:32 UTC