Re: URI Comparisons: RFC 2616 vs. RDF from Alan Ruttenberg on 2011-01-20 (public-lod@w3.org from January 2011)

From: Alan Ruttenberg <alanruttenberg@gmail.com>
Date: Thu, 20 Jan 2011 00:31:46 -0500
To: nathan@webr3.org
Cc: David Wood <david@3roundstones.com>, Dave Reynolds <dave.e.reynolds@gmail.com>, "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <AANLkTik4Jy5BKyCxMWPo12QJkuLZB==hEwy=NkPyg8-h@mail.gmail.com>
[for some reason my client isn't quoting previous mail properly, so my
comments are prefixed with [AR]]

On Wed, Jan 19, 2011 at 4:45 PM, Nathan <nathan@webr3.org> wrote:

> David Wood wrote:
>
>> On Jan 19, 2011, at 10:59, Nathan wrote:
>>
>>> ps: as an illustration of how engrained URI normalization is, I've
>>> capitalized the domain names in the to: and cc: fields, I do hope the mail
>>> still come through, and hope that you'll accept this email as being sent to
>>> you. Hopefully we'll also find this mail in the archives shortly at
>>> htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd
>>> hope that any statements made using these URIs (asserted by man or machine)
>>> would remain valid regardless of the (incorrect?-)casing.
>>>
>>
>> Heh.  OK, I'll bite.  Domain names in email addressing are defined in IETF
>> RFC 2822 (and its predecessor RFC 822), which defers the interpretation to
>> RFC 1035 ("Domain names - implementation and specification).  RFC 1035
>> section 2.3.3 states that domain names in DNS, and therefore in (E)SMTP, are
>> to be compared in a case-insensitive manner.
>>
>> As far as I know, the W3C specs do not so refer to RFC 1035.
>>
>
> And I'll bite in the other direction, why not treat URIs as URIs? why go
> against both the RDF Specification [1] and the URI specification when they
> say /not/ to encode permitted US-ASCII characters (like ~ %7E)? why force
> case-sensitive matching on the scheme and domain on URIs matching the
> generic syntax when the specs say must be compared case insensitively? and
> so on and so forth.
>

[AR]
Which specs? (or is it singular "spec") I just had a look at the XML
namespace spec, for instance, which partially governs the RDF/XML
serialization specification.
http://www.w3.org/TR/REC-xml-names/#NSNameComparison

"URI references identifying namespaces are compared when determining whether
a name belongs to a given namespace, and whether two names belong to the
same namespace. [Definition: The two URIs are treated as strings, and they
are *identical* if and only if the strings are identical, that is, if they
are the same sequence of characters. ] The comparison is case-sensitive, and
no %-escaping is done or undone.

A consequence of this is that URI references which are not identical in this
sense may resolve to the same resource. Examples include URI references
which differ only in case or %-escaping, or which are in external entities
which have different base URIs (but note that relative URIs are deprecated
as namespace names).

In a namespace declaration, the URI reference is the normalized
value<http://www.w3.org/TR/REC-xml/#AVNormalize> of
the attribute, so replacement of XML character and entity references has
already been done before any comparison.

Examples:

The URI references below are all different for the purposes of identifying
namespaces, since they differ in case:

http://www.example.org/wine

http://www.Example.org/wine

http://www.example.org/Wine

The URI references below are also all different for the purposes of
identifying namespaces:

http://www.example.org/~wilbur

http://www.example.org/%7ewilbur

http://www.example.org/%7Ewilbur"
So here is another spec that *explicitly* disagrees with the idea that URI
normalization should be a built-in processing.


> I have to be honest, I can't see what good this is doing anybody, in fact
> it's the complete opposite scenario, where data is being junked and scrapped
> because we are ignoring the specifications which are designed to enable
> interoperability and limit unexpected behaviour.
>

[AR] More to document, please: Which data is being junked and scrapped?

> I'm currently preparing a list of errors I'm finding in RDF, RDFa and
linked data tooling to do with this, and I have to admit even I'm surprised
at the sheer number of tools which are affected.

> Additionally there's a very nasty, and common, use case which I can't test
fully, so would appreciate people taking the time to check their own
libraries/clients, as follows:

[AR] Hmm. Are you suggesting that the behavior of libraries and clients
should have precedence over specification? My view is that one first looks
to specifications, and then only if specifications are poor or do not speak
to the issue do we look at existing behavior.

> If you find some data with the following setup (example):

 @base <htTp://EXAMPLE.org/foo/bar> .
 <#t> x:rel <../baz> .

and then you "follow your nose" to <htTp://EXAMPLE.org/baz>, will you find
any triples about it? (problem 1) and if there's no base on the second
resource, and it uses relative URIs, then the base you'll be using is <
htTp://EXAMPLE.org/baz>, and thus, you'll effectively create a new set of
statements which the author never wrote, or intended (problem 2).

In other words, in this scenario, no matter what you do you're either going
to get no data (even though it's there) or get a set of statements which
were never said by the author (because the casing is different).

[AR] I think there are many ways to lose in this scenario. For instance, if
the server redirects then the base is the last in the chain of redirects.
http://tools.ietf.org/html/rfc3986#page-29, 5.1.3. Base URI from the
Retrieval URI. My conclusion - don't engineer this way.

Further, essentially all RDFa ever encountered by a browser has the casing
on all URIs in href and src, and all these which are resolved, automatically
normalized - so even if you set the base to <htTp://EXAMPLE.org/> or use it
in a URI, browser tools, extensions, and js based libraries will only ever
see the normalized URIs (and thus be incompatible with the rest of the RDF
world).

[AR] Again, I think things are worse than possible to repair, if you take
the position that you need to make it work for deployed systems. As an
example I tried the following. On my mac I created the file
/Users/alanr/Desktop/foo.html. The contents
were: <script>alert(document.location);</script>. From the command line I
tried:

open file:///Users/alanr/Desktop/foo.html  ->
alert file:///Users/alanr/Desktop/foo.html
open file:///Users/alanr/Desktop/Foo.html ->
alert file:///Users/alanr/Desktop/Foo.html
open FILE:///Users/alanr/Desktop/foo.html ->
alert file:///Users/alanr/Desktop/foo.html
open file:///Users/alanr/Desktop/%66oo.html
-> alert file:///Users/alanr/Desktop/foo.html
open file:///Users/alanr/Desktop/%46oo.html
-> alert file:///Users/alanr/Desktop/Foo.html

[AR] In this case your conjecture is shown to be partially true. The scheme
URI is made case insensitive. However the the pathname is not normalized,
and results in mistakes in the intended base, in the case of file: URLs
seems to depend on the case sensitivity of the file system on which the URL
is resolved, something a generic processor could not possibly know.

I'll continue on getting the specific examples for current RDF tooling and
resources and get it on the wiki, but I'll say now that almost every tool
I've encountered so far "does it wrong" in inconsistent non-compatible ways.

Finally, I'll ask again, if anybody has any use case which benefits from <
htTp://EXAMPLE.org/%7efoo> and <http://example.org/~foo> being classed as
different RDF URIs, I'd love to hear it.

[AR] Backwards compatibility of OWL, RIF, and SPARQL.

[1] """The encoding consists of: ... 2. %-escaping octets that do not
correspond to permitted US-ASCII characters."""
 - http://www.w3.org/TR/rdf-concepts/#section-Graph-URIref

Best,

Nathan
Received on Thursday, 20 January 2011 05:32:35 UTC