RE: URL/URI/IRI resolution in HTML5/RDF/RDFa from Larry Masinter on 2011-11-02 (www-tag@w3.org from November 2011)

From: Larry Masinter <masinter@adobe.com>
Date: Wed, 2 Nov 2011 10:30:07 -0700
To: Jeni Tennison <jeni@jenitennison.com>, "www-tag@w3.org List" <www-tag@w3.org>
Message-ID: <C68CB012D9182D408CED7B884F441D4D0605EFB066@nambxv01a.corp.adobe.com>
" This behaviour by HTML5 was the subject of a long-running issue [4], which I believe Larry was involved in, which was eventually resolved to give the specification we see today."

The IRI/HTML5 issue is still being worked on actively.

" The RDFWAWG has opened rdfa-ISSUE-114 on this [10] which is on their agenda for discussion tomorrow. Do we have any advice for them?"

My advice is that they make sure their requirements for IRI standards are clear and to get them accepted by the IETF IRI working group, which is chartered to develop a solution that meets the broader requirements not only for browsers (HTML5) but for other applications that need IRIs that have nothing to do with HTML except for the possibility of copy/paste.

Larry

-----Original Message-----
From: Jeni Tennison [mailto:jeni@jenitennison.com] 
Sent: Wednesday, November 02, 2011 8:02 AM
To: www-tag@w3.org List
Subject: URL/URI/IRI resolution in HTML5/RDF/RDFa

Hi,

Something has come up in the discussions around microdata and RDFa that points to some wider issues around IRIs. I know that the TAG has discussed issues around IRI equivalence, IRIs and HTML and so on before, and I can't claim to have explored every angle here, so I wondered if anyone else had any opinions on it and what we should do.

The tl;dr version is that HTML5's rules around URL processing don't handle IRIs in the same way as RDFa's which could make for weird results for users in corner cases right now and rather larger issues with non-URI IRIs in the future.

Fuller explanation follows.

BACKGROUND

HTML5 uses the term URL throughout. The definition that it uses for a valid URL [1] is:

  A URL is a valid URL if at least one of the following conditions holds:

    * The URL is a valid URI reference [RFC3986].

    * The URL is a valid IRI reference and it has no query component. [RFC3987]

    * The URL is a valid IRI reference and its query component contains no 
      unescaped non-ASCII characters. [RFC3987]

    * The URL is a valid IRI reference and the character encoding of the 
      URL's Document is UTF-8 or a UTF-16 encoding. [RFC3987]

This allows IRIs to appear within documents so long as the character encoding of the document in which the URL is found is UTF-8 or UTF-16.

DOM attributes whose values reflect HTML attributes whose values are URLs (such as @href, @src, @itemid and so on) are then resolved through the HTML5 resolution algorithm [2]. This turns all IRIs into URIs by percent-encoding characters that aren't allowed in URIs and performs resolution based on URI rules from RFC3986 [3]. It also performs a couple of other "non-standard" normalisations, such as changing "\" to "/". The results are always valid URIs.

This behaviour by HTML5 was the subject of a long-running issue [4], which I believe Larry was involved in, which was eventually resolved to give the specification we see today.

In RDFa-Core [5], resolution of IRIs in all cases is done through the standard *IRI* resolution algorithm from RFC3987 [6].

The RDF restrictions on URIs used to identify resources is documented in its abstract semantics [7]. This currently normalises IRIs to URIs by percent-encoding non-ASCII characters, so currently the effective RDF generated from RDFa will contain URIs.

The new draft of the RDF 1.1 abstract semantics [8] allows IRIs to be used as identifiers for resources. For future RDF 1.1 implementations, the effective RDF generated from RDFa will contain IRIs. Importantly, since these IRIs are being used as identifiers, their equivalence will be assessed through string-equivalence rather than by first normalising to URIs and then comparing. [9]

CURRENT ISSUES

This is a problem for people using RDFa generally because the resolution of those URL attributes defined in HTML5 (@href, @src etc) differs from the resolution of URL attributes defined by RDFa (@resource, @typeof etc). Specifically:

 * normalising IRIs to URIs and then resolving according to RFC3986 (URI resolution) might not (?) produce the same results as resolving IRIs according to RFC3987 (IRI resolution) and then normalising to a URI
 * URLs that contain "\" characters will definitely be treated differently in the two cases

It is also a problem when people are using RDFa and microdata side-by-side (or switching between them) because the URIs they use within @itemid, @itemtype and @itemprop will not be handled in the same way as those within @about, @typeof and @property, resulting in slightly different data in the two cases.

FUTURE ISSUES

These discrepancies will be worse when RDF 1.1 is standardised and used with HTML+RDFa, as at that point some of the identifiers generated from HTML+RDFa processing will be normalised to URIs (those in @href, @src etc attributes) while others will be IRIs (those in @resource, @typeof, @property etc attributes).

WHAT TO DO

The RDFWAWG has opened rdfa-ISSUE-114 on this [10] which is on their agenda for discussion tomorrow. Do we have any advice for them? Options as I see them are:

 1. Advising that RDFa processing over HTML5 (and XHTML5) ignores the HTML5 URL 
    resolution algorithm and uses standard IRI resolution. This ensures that
    all RDF (including 1.1) can be expressed within web pages using RDFa but
    will lead to inconsistent identifiers in DOM and RDF, and implementation
    problems for client-side RDFa parsers that use the DOM. It also makes
    mixing and switching between microdata and RDFa difficult where IRIs are
    used.

 2. Advising that RDFa processing over HTML5 (and XHTML5) adopts the HTML URL
    resolution algorithm. This limits the RDF (1.1) that can be expressed 
    within (X)HTML pages using RDFa to that which identifies resources with 
    URIs (not IRIs) but ensures consistent identifiers are used within the DOM
    and RDF generated from the DOM, eases RDFa implementation on the client
    and makes it easier to mix/move between microdata and RDFa. However, it
    means that the RDF generated from RDFa in a source XML document and the
    RDF generated from equivalent RDFa in an equivalent source HTML document 
    might not be the same where URIs containing "\" or non-URI IRIs are used.

(There's a third, 'do nothing' option, of course, but that's significantly worse than either of the above.)

More generally, I don't anticipate there being any point reopening the issue on URLs in HTML5 but I could open a bug on this for URLs appearing in @itemid, @itemtype and @itemprop, which are always absolute URLs and therefore do not technically have to go through the resolution algorithm, and are identifiers and therefore according to the IRI specification should not undergo any normalisation.

My feeling is that the TAG ought to have some kind of advice for people writing specs about technologies that use IRIs and that interact with HTML5 to warn them about the side-effects.

In particular, I'm not sure whether this has been picked up in the HTML/XML Task Force and it might have an impact when you start looking at the behaviour of the XML stack.

Cheers,

Jeni

[1]  http://dev.w3.org/html5/spec/urls.html#valid-url
[2]  http://dev.w3.org/html5/spec/urls.html#resolving-urls
[3]  http://tools.ietf.org/html/rfc3986
[4]  http://www.w3.org/html/wg/tracker/issues/56
[5]  http://www.w3.org/2010/02/rdfa/drafts/2011/ED-rdfa-core-20111020/#s_curieprocessing
[6]  http://www.ietf.org/rfc/rfc3987.txt
[7]  http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref
[8]  http://www.w3.org/TR/rdf11-concepts/#section-IRI-Vocabulary
[9]  http://tools.ietf.org/html/rfc3987#section-5
[10] http://www.w3.org/2010/02/rdfa/track/issues/114
-- 
Jeni Tennison
http://www.jenitennison.com
Received on Wednesday, 2 November 2011 17:31:22 UTC