Re: URI Comparisons: RFC 2616 vs. RDF from Harry Halpin on 2011-01-20 (public-lod@w3.org from January 2011)

From: Harry Halpin <hhalpin@ibiblio.org>
Date: Fri, 21 Jan 2011 00:40:58 +0100
To: nathan@webr3.org
Cc: Alan Ruttenberg <alanruttenberg@gmail.com>, David Wood <david@3roundstones.com>, Dave Reynolds <dave.e.reynolds@gmail.com>, "public-lod@w3.org" <public-lod@w3.org>, Sandro Hawke <sandro@w3.org>
Message-ID: <AANLkTin_fz6-FuSCxp3aTJ9VxBZN58b80YMRe_DG8rEH@mail.gmail.com>
On Thu, Jan 20, 2011 at 11:15 AM, Nathan <nathan@webr3.org> wrote:
> Alan Ruttenberg wrote:
>>
>> On Wed, Jan 19, 2011 at 4:45 PM, Nathan <nathan@webr3.org> wrote:
>>>
>>> David Wood wrote:
>>>>
>>>> On Jan 19, 2011, at 10:59, Nathan wrote:
>>>>
>>>>> ps: as an illustration of how engrained URI normalization is, I've
>>>>> capitalized the domain names in the to: and cc: fields, I do hope the
>>>>> mail
>>>>> still come through, and hope that you'll accept this email as being
>>>>> sent to
>>>>> you. Hopefully we'll also find this mail in the archives shortly at
>>>>> htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally
>>>>> I'd
>>>>> hope that any statements made using these URIs (asserted by man or
>>>>> machine)
>>>>> would remain valid regardless of the (incorrect?-)casing.
>>>>>
>>>> Heh.  OK, I'll bite.  Domain names in email addressing are defined in
>>>> IETF
>>>> RFC 2822 (and its predecessor RFC 822), which defers the interpretation
>>>> to
>>>> RFC 1035 ("Domain names - implementation and specification).  RFC 1035
>>>> section 2.3.3 states that domain names in DNS, and therefore in (E)SMTP,
>>>> are
>>>> to be compared in a case-insensitive manner.
>>>>
>>>> As far as I know, the W3C specs do not so refer to RFC 1035.
>>>>
>>> And I'll bite in the other direction, why not treat URIs as URIs? why go
>>> against both the RDF Specification [1] and the URI specification when
>>> they
>>> say /not/ to encode permitted US-ASCII characters (like ~ %7E)? why force
>>> case-sensitive matching on the scheme and domain on URIs matching the
>>> generic syntax when the specs say must be compared case insensitively?
>>> and
>>> so on and so forth.
>>
>> [AR]
>> Which specs?
>
> The various URI/IRI specs and previous revisions of.
>
>> http://www.w3.org/TR/REC-xml-names/#NSNameComparison
>>
>> "URI references identifying namespaces
>
> ..
>>
>> In a namespace declaration, the URI reference is
>
> ..
>>
>> The URI references below are all different for the purposes of identifying
>> namespaces
>
> ..
>>
>> The URI references below are also all different for the purposes of
>> identifying namespaces
>
> ..
>>
>> So here is another spec that *explicitly* disagrees with the idea that URI
>> normalization should be a built-in processing.
>
> As far as I can see, that's only for a URI reference used within a
> namespace, and does not govern usage or normalization when you join the URI
> reference up with the local name to make the full URI.
>
> Out of interest, where is that process defined? I was looking for it the
> other day - for instance in the quoted specification we have the example:
>
> <edi:price xmlns:edi='http://ecommerce.example.org/schema'
> units='Euro'>32.18</edi:price>
>
> Where's the bit of the XML specification which says you join them up by
> concatenating 'http://ecommerce.example.org/schema' with #(?assumed?) and
> 'Euro' to get 'http://ecommerce.example.org/schema#Euro'?
>

Actually you don't. A namespace is just that - a tuple (namespace,
localname) in XML. That's why namespaces in XML are far all intents
and purposes broken and why, to a large extent, Web browser developers
in HTML stopped using them and hate implementing them in the DOM, and
so refuse to have them in HTML5. And that's one reason RDF(A) will
probably continue getting a sort of bad rap in the HTML world, as
prefixes are not associated with just making URIs, but with this
terrible namespace tuple.

For an archeology of the relevant standards, check out Section "What
Namespaces Do" of this paper. While the paper is focussed on why
namespace documents are a mess, the relevant information is in that
section and extensively referenced, with examples:

http://xml.coverpages.org/HHalpinXMLVS-Extreme.html

> And finally, this is why I specifically asked if the non-normalization of
> RDF URI References had XML Namespace heritage, which had then filtered down
> through OWL, SPARQL and RIF.

Indeed, they should be normalized in a sane manner across all Semantic
Web specs, and dependencies on XML Namespaces should obviously be
dropped IMHO.

>
>> [AR] More to document, please: Which data is being junked and scrapped?
>
> will document, but essentially every statement made using a non normalized
> URI when other statements are also being made about the "same" resource
> using normalized URIs - the two most common cases for this will be when
> people are using "CMS" systems and enter their domain name as uppercase in
> some admin, only to have that filter through to URIs in serialized RDF/RDFa,
> and where bugs in software have led to inconsistent URIs over time (for
> instance where % encoding has been fixed, or a :80 has been removed from a
> URI).
>
>> [AR] Hmm. Are you suggesting that the behavior of libraries and clients
>> should have precedence over specification? My view is that one first looks
>> to specifications, and then only if specifications are poor or do not
>> speak
>> to the issue do we look at existing behavior.

Which is the case with namespaces and URI normalization :)

>
> Yes I am, that specification should standardize the behaviour of libraries
> and clients - the level of normalization in URIs published, consumed or used
> by these tools is often determined by non sem web stack components, and the
> sem web components are blocked from normalizing these
> should-not-be-differing-URIs by the sem web specifications.
>
>> [AR] I think there are many ways to lose in this scenario. For instance,
>> if
>> the server redirects then the base is the last in the chain of redirects.
>> http://tools.ietf.org/html/rfc3986#page-29, 5.1.3. Base URI from the
>> Retrieval URI. My conclusion - don't engineer this way.
>
> That would be my conclusion too, but as RDF(a) moves in to the realms of the
> CMS systems and out of the hands of the sem web community, it will be
> increasingly engineered this way, it's a very common pattern when working
> with (X)HTML (allows people to test locally or on dev servers without
> changing the content).
>
>> Further, essentially all RDFa ever encountered by a browser has the casing
>> on all URIs in href and src, and all these which are resolved,
>> automatically
>> normalized - so even if you set the base to <htTp://EXAMPLE.org/> or use
>> it
>> in a URI, browser tools, extensions, and js based libraries will only ever
>> see the normalized URIs (and thus be incompatible with the rest of the RDF
>> world).
>>
>> [AR] Again, I think things are worse than possible to repair, if you take
>> the position that you need to make it work for deployed systems. As an
>> example I tried the following. On my mac I created the file
>> /Users/alanr/Desktop/foo.html. The contents
>> were: <script>alert(document.location);</script>. From the command line I
>> tried:
>>
>> open file:///Users/alanr/Desktop/foo.html  ->
>> alert file:///Users/alanr/Desktop/foo.html
>> open file:///Users/alanr/Desktop/Foo.html ->
>> alert file:///Users/alanr/Desktop/Foo.html
>> open FILE:///Users/alanr/Desktop/foo.html ->
>> alert file:///Users/alanr/Desktop/foo.html
>> open file:///Users/alanr/Desktop/%66oo.html
>> -> alert file:///Users/alanr/Desktop/foo.html
>> open file:///Users/alanr/Desktop/%46oo.html
>> -> alert file:///Users/alanr/Desktop/Foo.html
>
> Indeed, if you tried that in chrome/IE you'd get full normalization, in
> Opera you'd get something similar to above, and in firefox different again
> (unsure if it also differs per OS). See:
>
>  http://webr3.org/urinorm/html
>
> I did some testing this way yesterday and flagged it up with Adam Barth
> who's handling the URI/URL canonicalization and normalization for the HTML /
> webapps specifications yesterday [1].
>
> The results /really/ affect RDFa Processing, see:
>
>  http://webr3.org/urinorm/2
>
> And as a member of the RDFa WG, focussed mainly on the API specifications,
> this is a real problem that needs solved - @href and @src are governed by
> HTML, access to the URIs within is via the DOM specifications, and the
> common implementations all provide normalization as standard, and as far as
> I can tell Adam Barth will be aligning expected normalization and
> canonicalization in the specifications. RDFa sitting at the intersection of
> this is very affected.
>
> Note, those are not my only reasons for flagging up this normalization issue
> - (issue imo, time will tell if it is considered, or made, an issue by the
> RDF WG).
>
> [1]
> http://lists.w3.org/Archives/Public/public-html-comments/2011Jan/0004.html
>
>> [AR] In this case your conjecture is shown to be partially true. The
>> scheme
>> URI is made case insensitive. However the the pathname is not normalized,
>> and results in mistakes in the intended base, in the case of file: URLs
>> seems to depend on the case sensitivity of the file system on which the
>> URL
>> is resolved, something a generic processor could not possibly know.
>
> As above, you'll find increasing steps of normalization by different
> vendors, chrome and IE for example do "full" normalization.
>
>
>> Finally, I'll ask again, if anybody has any use case which benefits from <
>> htTp://EXAMPLE.org/%7efoo> and <http://example.org/~foo> being classed as
>> different RDF URIs, I'd love to hear it.
>>
>> [AR] Backwards compatibility of OWL, RIF, and SPARQL.
>
> Surely that's only an issue if somebody somewhere has data which would be
> negatively impacted by URIs being normalized - is there such a case? It may
> also be wise to consider whether people would benefit from URI normalization
> in regards to RDF, OWL, RIF and SPARQL - and if so surely there's a case for
> raising BUGs and fixing this throughout the sem web specifications.
>
> Cheers,
>
> Nathan
>
>
Received on Thursday, 20 January 2011 23:41:28 UTC