- From: Dave Reynolds <dave.e.reynolds@gmail.com>
- Date: Thu, 20 Jan 2011 13:08:12 +0000
- To: nathan@webr3.org
- Cc: "public-lod@w3.org" <public-lod@W3.ORG>
On Wed, 2011-01-19 at 21:45 +0000, Nathan wrote: > David Wood wrote: > > On Jan 19, 2011, at 10:59, Nathan wrote: > >> ps: as an illustration of how engrained URI normalization is, I've capitalized the domain names in the to: and cc: fields, I do hope the mail still come through, and hope that you'll accept this email as being sent to you. Hopefully we'll also find this mail in the archives shortly at htTp://lists.W3.org/Archives/Public/public-lod/2011Jan/ - Personally I'd hope that any statements made using these URIs (asserted by man or machine) would remain valid regardless of the (incorrect?-)casing. > > > > Heh. OK, I'll bite. Domain names in email addressing are defined in IETF RFC 2822 (and its predecessor RFC 822), which defers the interpretation to RFC 1035 ("Domain names - implementation and specification). RFC 1035 section 2.3.3 states that domain names in DNS, and therefore in (E)SMTP, are to be compared in a case-insensitive manner. > > > > As far as I know, the W3C specs do not so refer to RFC 1035. > > And I'll bite in the other direction, why not treat URIs as URIs? It seems to me the underlying question here is whether aliasing of URIs (whether they dereference to the same resource) should imply semantic equality (i.e. use as an identifier in a web logic language like RDF or OWL). The position so far in RDF, OWL and RIF has been "no" As far as the specifications for those languages are concerned a URI is "just" a convenient spelling for an identifier and they require comparison of identifiers to be stable and context-independent. Those specs don't constrain what you get back from dereferencing some URI U to include statements about U. The URI spec (rfc3986[1]) does allow this usage. In particular Section 6 Normalization and Comparison says: """URI comparison is performed for some particular purpose. Protocols or implementations that compare URIs for different purposes will often be subject to differing design trade-offs in regards to how much effort should be spent in reducing aliased identifiers. This section describes various methods that may be used to compare URIs, the trade-offs between them, and the types of applications that might use them.""" and """We use the terms "different" and "equivalent" to describe the possible outcomes of such comparisons, but there are many application-dependent versions of equivalence.""" While RDF predates this spec it seems to me that the RDF usage remains consistent with it. The purpose of comparison in RDF is different from that of cache retrieval of web pages or message delivery of email. This quote also makes clear that there is no single definitive normalization. There are different levels of normalization possible depending on your needs. Earlier you pointed out that the place where the URI specs and RDF do collide is in resolving relative URIs into absolute URIs. Again rfc3986 does not preclude the RDF usage. Section 5.2.1 says: """Normalization of the base URI, as described in Sections 6.2.2 and 6.2.3, is optional.""" So I claim that in terms of formal published specifications: (1) RDF, OWL and RIF do not require any normalization of URIs (beyond the character encoding level) and compare URIs by simple string comparison. (2) This usage is *not* precluded by the URI specs, at least by 3986 which sets the current framework for the application of scheme-specific specs. ** Now we turn to linked data ... As we've already mentioned :) there are no specs for linked data so we move onto more subjective grounds. The linked data convention is that dereferencing some URI U in your RDF document should return information about U, including further onward links. So if data set A spells a URI hTTp://example.com/foo but the data you get from dereferencing that URI talks only about http://example.com/foo then someone has a problem somewhere. The question is who, where and how to fix it. It seems to me that this is primarily a issue with publishing, and a little about being sensible about how you pass on links. If I'm going to put up some linked data I should mint normalized URIs; I should use the same spelling of the URIs throughout my data; I'll make sure those URIs dereference and that the data that comes back is stable and useful. If someone else refers to my resources using an aliased URI (such as a different case for the protocol) and makes statements about those aliases then they have simply made a mistake. To make sure that dereference returns what I expect, independent of aliasing, then I should publish data with explicit base URIs (or just absolute URIs). Publishing with relative URIs and no base is a recipe for having your data look different from different places. Just don't do it. No surprise there. None of this requires us to force URI normalization into the heart of identifier comparison in RDF itself. It is not a necessary solution and it is not a sufficient one because there is no universal normalization algorithm that would make all possible locator aliasing disappear. > why go > against both the RDF Specification [1] and the URI specification when > they say /not/ to encode permitted US-ASCII characters (like ~ %7E)? Where did that example come from? At what point have we suggested doing that? > why > force case-sensitive matching on the scheme and domain on URIs matching > the generic syntax when the specs say must be compared case > insensitively? No, the specs do not say that, see above. > Additionally there's a very nasty, and common, use case which I can't > test fully, so would appreciate people taking the time to check their > own libraries/clients, as follows: > > If you find some data with the following setup (example): > > @base <htTp://EXAMPLE.org/foo/bar> . > <#t> x:rel <../baz> . > > and then you "follow your nose" to <htTp://EXAMPLE.org/baz>, will you > find any triples about it? (problem 1) Yes if that is the URI the publisher intended to use and has published data there. If he/she actually uses http://example.org/baz then whoever gave you he original sample has corrupt data somehow. It so happens that in that case the dereference will "work" in the sense of give you data but it will be about the correct URI not the broken one that you've been given. > and if there's no base on the > second resource, and it uses relative URIs, then the base you'll be > using is <htTp://EXAMPLE.org/baz>, and thus, you'll effectively create a > new set of statements which the author never wrote, or intended (problem 2). Correct. Publishing data that way would be a bad idea. That problem need not, and can't, be solved by changing the comparison of identifiers in RDF. > In other words, in this scenario, no matter what you do you're either > going to get no data (even though it's there) or get a set of statements > which were never said by the author (because the casing is different). Someone has given you an erroneous URI, you either get no data or get data which might help you find the right URI. > Further, essentially all RDFa ever encountered by a browser has the > casing on all URIs in href and src, and all these which are resolved, > automatically normalized - so even if you set the base to > <htTp://EXAMPLE.org/> or use it in a URI, browser tools, extensions, and > js based libraries will only ever see the normalized URIs (and thus be > incompatible with the rest of the RDF world). So use normalized URIs in the first place. I've heard no use case for wanting to publish data about the URI htTp://EXAMPLE.org/ in the first place and if you did then using RDFa as the means of doing so would be perverse. > I'll continue on getting the specific examples for current RDF tooling > and resources and get it on the wiki, but I'll say now that almost every > tool I've encountered so far "does it wrong" in inconsistent > non-compatible ways. The notion of "wrong" hasn't yet been made clear here. > Finally, I'll ask again, if anybody has any use case which benefits from > <htTp://EXAMPLE.org/%7efoo> and <http://example.org/~foo> being classed > as different RDF URIs, I'd love to hear it. Never heard of one. RDF/OWL/RIF aren't designed the way they are because someone thought it would be a good idea to allow such things to be used side by side or because they *want* people to use denormalized URIs. The point is that there is no single, simple, universal (i.e. across all schemes) normalization algorithm that could be used. The current approach gives stable, well-defined behaviour which doesn't change as people invent new URI schemes. The RDF serializations give you enough control to enable you to be certain about what URI you are talking about. Job done. Choosing good URIs so they work nicely with your deployment and publishing solution is an important but different topic. I certainly think linked data publishers and publishing tools are better off minting normalized URIs. Dave [1] http://www.ietf.org/rfc/rfc3986.txt
Received on Thursday, 20 January 2011 13:08:57 UTC