Re: IRI guidance from Ivan Herman on 2011-04-29 (public-rdf-wg@w3.org from April 2011)

From: Ivan Herman <ivan@w3.org>
Date: Fri, 29 Apr 2011 14:30:10 +0200
To: Eric Prud'hommeaux <eric@w3.org>
Cc: Alex Hall <alexhall@revelytix.com>, Nathan Rixham <nathan@webr3.org>, RDF WG <public-rdf-wg@w3.org>, Dan Brickley <danbri@danbri.org>
Message-Id: <62044C3A-0E0C-4B66-882E-469B2EC99BD7@w3.org>
CC DanBri explicitly, because he will be the editor of the concepts' document...


On Apr 29, 2011, at 14:14 , Eric Prud'hommeaux wrote:

> * Ivan Herman <ivan@w3.org> [2011-04-29 08:24+0200]
>> 
>> On Apr 28, 2011, at 23:59 , Eric Prud'hommeaux wrote:
>> <snip/>
>>>> 
>>>> Unfortunately this can lead to unexpected consequences, such as an
>>>> application dereferencing the IRI http://xn--rsum-bpad.example.org (not sure
>>>> how GMail will escape that -- that's the punycode version) and getting a
>>>> document with a description of some resource with IRI
>>>> http://résumé.example.org <http://xn--rsum-bpad.example.org> (Unicode
>>>> version).  To help prevent this, we could discourage the use of IRIs with
>>>> encoded IDNs in RDF, similar to how the existing spec discourages the use of
>>>> URI Refs with percent-escaped characters.
>>> 
>>> I think this leads down the path of not using IRIs. When dereferencing
>>> an HTTP IRI, one has to punyify the domain name and percentulate the
>>> path, mapping http://伝言.example/?user=أكرم to
>>> http://xn--9oqp94l.example/?user=%D8%A3%D9%83%D8%B1%D9%85 . Any IRI
>>> with characters outside of the legal URI characters will map to a
>>> differently spelled URI, necessitating some typing of these respective
>>> strings. If we're taking away the sharp knives, we'll have to take
>>> away non-ascii characters and díäcrìtïcâl markç.
>> 
>> Eric, I am not sure I understand that. The proposal is to say that, in RDF, there should be a preference for the UTF version of the URI-s, ie, I should, if possible, opt for http://伝言.example/?user=أكرم rather than the the other version. What happens underneath if I dereference that URI and send it to tools for an HTTP get or anything similar is a separate issue. Indeed, on an English keyboard typing something even as simple as http://iván.herman.net is a pain for a user, but that is a practical problem which is again outside the realm of RDF.
> 
> Ahh, I interpreted "discourage … encoded IDNs" as discouraging UTF-8-encoded IRIs while the intent was discouraging punycode-encoded. Sorry.
> 

:-)


> 
>> Ie: saying that we keep to the current version of RDF, ie, equality of IRI-s is based on a character-by-character comparison (like now) but giving an advice to, if possible, use the IRI without the punycode seems to be a reasonable way of handling this... What else would you propose instead?
> 
> I'm all for character-by-character comparison. I think the emphasis should be on keeping track of the type. Here's a draft of a minimal change to the Concepts document:
> [[
> 6.2 RDF Graph
> An RDF triple contains three components:
> 
>    * the subject, which is an IRI or a blank node 
>    * the predicate, which is an IRI
>    * the object, which is an IRI, a literal or a blank node 
> …
> 6.4 IRI
> 
> An IRI within an RDF graph (an RDF URI reference) is a Unicode string
> [UNICODE] that conforms to the definition of an IRI in RFC2397 [IRI].
> Implementations may issue warnings concerning the use of RDF terms
> designated to be IRIs but which are not conformant to the IRI
> definition.
> 
> Note: RFC2397 Section 3.1. "Mapping of IRIs to URIs" specifies the
> mapping to URIs, which must be done, for instance, when constructing
> an HTTP GET request. This specification does not define a relationship
> between an IRI and the URI to which it is mapped.
> 
> Note: RFC2397 Section 5.3.1. "Simple String Comparison" specifies
> equivalence for IRIs used as identity tokes, as they are in RDF
> graphs.
> 
> Note: IRIs are compatible with the anyURI datatype as defined by XML
> schema datatypes [XML-SCHEMA2], constrained to be an absolute rather
> than a relative URI reference.
> 
> Note: IRIs are compatible with International Resource Identifiers as
> defined by [XML Namespaces 1.1].
> 
> Note: The restriction to absolute IRIs is found in this abstract
> syntax. When there is a well-defined base, concrete syntaxes, such as
> RDF/XML, may permit relative IRIs as a shorthand for such absolute IRIs.
> ]]
> 
> Note, I changed "RDF URI reference" to "IRI" instead of "RDF IRI" as I'm not convinced that an IRI which appears in an RDF document is of a different type than an IRI which appears in an email or in the location bar of my browser.
> 
> Here I proposed saying that IRIs and their URIs are simply different things, eliding the syntactic hint
> x [[
> x Note: Because of the risk of confusion between RDF URI references that
> x would be equivalent if derefenced, the use of %-escaped characters in
> x RDF URI references is strongly discouraged. See also the URI
> x equivalence issue of the Technical Architecture Group [TAG].
> x ]]
> 
> I agree with Alex that punycoded domain names and %-escaped characters should be mentioned in the same breath. From a human-engineering perspective, I think any text specifying syntactic hints to help observers visually discriminate them discourages programmers from being conscientious about the distinction. However, if we want to encourage the world to mint IRIs which we can procedurally calculate from URIs (motivated perhaps by associating HTTP traffic with assertions about resources), we could add some text encouraging an unambiguous transformation:
> 
> [[
> Note: RFC2397's mapping of IRIs to URIs does not alter "%25" or
> punycoded domain names, which means that the IRIs
> <http://伝言.example/R&D> and <http://xn--9oqp94l.example/R%25D> will
> both be transformed to the URI to <http://xn--9oqp94l.example/R%25D>.
> RFC2397 section 3.2. "Converting URIs to IRIs" defines a function
> which produces a single IRI for any URI. When minting IRIs for RDF,
> it is encouraged to mint forms which can round trip to a URI form
> and back.
> ]]

I think that the round-trip issue may not be clear (it is not 100% clear to me either:-). Why not adding something like

'In other words, the use of %-escaped characters or punycode encoded IDN-s are strongly discouraged.'

Ivan



> 
> 
>> Cheers
>> 
>> Ivan
>> 
>> 
>>> 
>>> 
>>>> -Alex
>>> 
>>> -- 
>>> -ericP
>>> 
>> 
>> 
>> ----
>> Ivan Herman, W3C Semantic Web Activity Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> PGP Key: http://www.ivan-herman.net/pgpkey.html
>> FOAF: http://www.ivan-herman.net/foaf.rdf
>> 
>> 
>> 
>> 
>> 
> 
> -- 
> -ericP


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
PGP Key: http://www.ivan-herman.net/pgpkey.html
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Friday, 29 April 2011 12:29:06 UTC