Re: URI Comparisons: RFC 2616 vs. RDF from Nathan on 2011-01-20 (public-lod@w3.org from January 2011)

From: Nathan <nathan@webr3.org>
Date: Thu, 20 Jan 2011 10:15:37 +0000
To: Alan Ruttenberg <alanruttenberg@gmail.com>
CC: David Wood <david@3roundstones.com>, Dave Reynolds <dave.e.reynolds@gmail.com>, "public-lod@w3.org" <public-lod@w3.org>, Sandro Hawke <sandro@w3.org>
Message-ID: <4D380B49.7080409@webr3.org>
Alan Ruttenberg wrote:
> On Wed, Jan 19, 2011 at 4:45 PM, Nathan <nathan@webr3.org> wrote:
>> David Wood wrote:
>>> On Jan 19, 2011, at 10:59, Nathan wrote:
>>>
>>>> ps: as an illustration of how engrained URI normalization is, I've
>>>> capitalized the domain names in the to: and cc: fields, I do hope the mail
>>>> still come through, and hope that you'll accept this email as being sent to
>>>> you. Hopefully we'll also find this mail in the archives shortly at
>>>> htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd
>>>> hope that any statements made using these URIs (asserted by man or machine)
>>>> would remain valid regardless of the (incorrect?-)casing.
>>>>
>>> Heh.  OK, I'll bite.  Domain names in email addressing are defined in IETF
>>> RFC 2822 (and its predecessor RFC 822), which defers the interpretation to
>>> RFC 1035 ("Domain names - implementation and specification).  RFC 1035
>>> section 2.3.3 states that domain names in DNS, and therefore in (E)SMTP, are
>>> to be compared in a case-insensitive manner.
>>>
>>> As far as I know, the W3C specs do not so refer to RFC 1035.
>>>
>> And I'll bite in the other direction, why not treat URIs as URIs? why go
>> against both the RDF Specification [1] and the URI specification when they
>> say /not/ to encode permitted US-ASCII characters (like ~ %7E)? why force
>> case-sensitive matching on the scheme and domain on URIs matching the
>> generic syntax when the specs say must be compared case insensitively? and
>> so on and so forth.
> 
> [AR]
> Which specs?

The various URI/IRI specs and previous revisions of.

> http://www.w3.org/TR/REC-xml-names/#NSNameComparison
> 
> "URI references identifying namespaces
..
> In a namespace declaration, the URI reference is
..
> The URI references below are all different for the purposes of identifying
> namespaces
..
> The URI references below are also all different for the purposes of
> identifying namespaces
..
> So here is another spec that *explicitly* disagrees with the idea that URI
> normalization should be a built-in processing.

As far as I can see, that's only for a URI reference used within a 
namespace, and does not govern usage or normalization when you join the 
URI reference up with the local name to make the full URI.

Out of interest, where is that process defined? I was looking for it the 
other day - for instance in the quoted specification we have the example:

<edi:price xmlns:edi='http://ecommerce.example.org/schema' 
units='Euro'>32.18</edi:price>

Where's the bit of the XML specification which says you join them up by 
concatenating 'http://ecommerce.example.org/schema' with #(?assumed?) 
and 'Euro' to get 'http://ecommerce.example.org/schema#Euro'?

And finally, this is why I specifically asked if the non-normalization 
of RDF URI References had XML Namespace heritage, which had then 
filtered down through OWL, SPARQL and RIF.

> [AR] More to document, please: Which data is being junked and scrapped?

will document, but essentially every statement made using a non 
normalized URI when other statements are also being made about the 
"same" resource using normalized URIs - the two most common cases for 
this will be when people are using "CMS" systems and enter their domain 
name as uppercase in some admin, only to have that filter through to 
URIs in serialized RDF/RDFa, and where bugs in software have led to 
inconsistent URIs over time (for instance where % encoding has been 
fixed, or a :80 has been removed from a URI).

> [AR] Hmm. Are you suggesting that the behavior of libraries and clients
> should have precedence over specification? My view is that one first looks
> to specifications, and then only if specifications are poor or do not speak
> to the issue do we look at existing behavior.

Yes I am, that specification should standardize the behaviour of 
libraries and clients - the level of normalization in URIs published, 
consumed or used by these tools is often determined by non sem web stack 
components, and the sem web components are blocked from normalizing 
these should-not-be-differing-URIs by the sem web specifications.

> [AR] I think there are many ways to lose in this scenario. For instance, if
> the server redirects then the base is the last in the chain of redirects.
> http://tools.ietf.org/html/rfc3986#page-29, 5.1.3. Base URI from the
> Retrieval URI. My conclusion - don't engineer this way.

That would be my conclusion too, but as RDF(a) moves in to the realms of 
the CMS systems and out of the hands of the sem web community, it will 
be increasingly engineered this way, it's a very common pattern when 
working with (X)HTML (allows people to test locally or on dev servers 
without changing the content).

> Further, essentially all RDFa ever encountered by a browser has the casing
> on all URIs in href and src, and all these which are resolved, automatically
> normalized - so even if you set the base to <htTp://EXAMPLE.org/> or use it
> in a URI, browser tools, extensions, and js based libraries will only ever
> see the normalized URIs (and thus be incompatible with the rest of the RDF
> world).
> 
> [AR] Again, I think things are worse than possible to repair, if you take
> the position that you need to make it work for deployed systems. As an
> example I tried the following. On my mac I created the file
> /Users/alanr/Desktop/foo.html. The contents
> were: <script>alert(document.location);</script>. From the command line I
> tried:
> 
> open file:///Users/alanr/Desktop/foo.html  ->
> alert file:///Users/alanr/Desktop/foo.html
> open file:///Users/alanr/Desktop/Foo.html ->
> alert file:///Users/alanr/Desktop/Foo.html
> open FILE:///Users/alanr/Desktop/foo.html ->
> alert file:///Users/alanr/Desktop/foo.html
> open file:///Users/alanr/Desktop/%66oo.html
> -> alert file:///Users/alanr/Desktop/foo.html
> open file:///Users/alanr/Desktop/%46oo.html
> -> alert file:///Users/alanr/Desktop/Foo.html

Indeed, if you tried that in chrome/IE you'd get full normalization, in 
Opera you'd get something similar to above, and in firefox different 
again (unsure if it also differs per OS). See:

   http://webr3.org/urinorm/html

I did some testing this way yesterday and flagged it up with Adam Barth 
who's handling the URI/URL canonicalization and normalization for the 
HTML / webapps specifications yesterday [1].

The results /really/ affect RDFa Processing, see:

   http://webr3.org/urinorm/2

And as a member of the RDFa WG, focussed mainly on the API 
specifications, this is a real problem that needs solved - @href and 
@src are governed by HTML, access to the URIs within is via the DOM 
specifications, and the common implementations all provide normalization 
as standard, and as far as I can tell Adam Barth will be aligning 
expected normalization and canonicalization in the specifications. RDFa 
sitting at the intersection of this is very affected.

Note, those are not my only reasons for flagging up this normalization 
issue - (issue imo, time will tell if it is considered, or made, an 
issue by the RDF WG).

[1] 
http://lists.w3.org/Archives/Public/public-html-comments/2011Jan/0004.html

> [AR] In this case your conjecture is shown to be partially true. The scheme
> URI is made case insensitive. However the the pathname is not normalized,
> and results in mistakes in the intended base, in the case of file: URLs
> seems to depend on the case sensitivity of the file system on which the URL
> is resolved, something a generic processor could not possibly know.

As above, you'll find increasing steps of normalization by different 
vendors, chrome and IE for example do "full" normalization.


> Finally, I'll ask again, if anybody has any use case which benefits from <
> htTp://EXAMPLE.org/%7efoo> and <http://example.org/~foo> being classed as
> different RDF URIs, I'd love to hear it.
> 
> [AR] Backwards compatibility of OWL, RIF, and SPARQL.

Surely that's only an issue if somebody somewhere has data which would 
be negatively impacted by URIs being normalized - is there such a case? 
It may also be wise to consider whether people would benefit from URI 
normalization in regards to RDF, OWL, RIF and SPARQL - and if so surely 
there's a case for raising BUGs and fixing this throughout the sem web 
specifications.

Cheers,

Nathan
Received on Thursday, 20 January 2011 10:16:45 UTC