- From: Mark Birbeck <mark.birbeck@webbackplane.com>
- Date: Tue, 29 Jul 2008 22:13:07 +0100
- To: "Manu Sporny" <msporny@digitalbazaar.com>
- Cc: "RDFa mailing list" <public-rdf-in-xhtml-tf@w3.org>
Hi Manu, > Right, I didn't mean to imply that 'appending' will work in all cases > (even though I'm not convinced that the statement is not true). You began this thread by saying that the bug in librdfa was that in some circumstances the relative part was incorrectly being appended to the document, rather than the host name. So haven't you proved it yourself, that appending doesn't work in all cases? :) What's happened is that you've now found two possible ways to append (to the document part or to the hostname), but I'm afraid that the algorithm for converting a relative URI to an absolute one involves yet further possibilities. For example, the relative path: sneaking_sally.mp3 should be appended to the end of the *path* part, replacing the document. And so on. So the point is to use the 'proper' algorithm for turning a relative path into an absolute one, and you will always be ok, no matter what the URI is that you are dealing with (relative or not). The big question then, is whether the spec actually says to do this. > What you have said has got me wondering about what is correct, > acceptable and incorrect, however. You also had me wondering, too. I recalled investigating this quite a long time ago, and was starting to panic that I hadn't actually incorporated what I learned from my analysis into the spec. But thankfully I did: 5.4. CURIE and URI Processing Since RDFa is ultimately a means for transporting RDF, then a key concept is the resource and its manifestation as a URI. Since RDF deals with complete URIs (not relative paths), then when converting RDFa to triples, any relative URIs will need to be resolved relative to the base URI, using the algorithm defined in section 5 of RFC 3986 [URI], Reference Resolution. It certainly sounds like this point could do with being made more prominent, but hopefully you'll agree that such changes would merely be editorial, and that the spec itself is correct. (See below for a further mention in the spec of this issue, but in the context of CURIEs.) >> <http://rdfa.digitalbazaar.com/fuzzbot/demo/../../live/sneaking_sally.mp3> > > I realize that the URL above is not optimal, but is it "wrong"? RFC-1738 > says that the URL is valid (if I'm reading the RFC correctly): > > ftp://ftp.isi.edu/in-notes/rfc1738.txt First, note that [1] updates RFC 1738. Second, you're right that the URI is not 'wrong'. But the only way to obtain such a URI would be to enter it exactly as you have shown it. I.e., tou can't create such a URI by beginning with a relative path and making it absolute, since the the only way to do that is according to section 5 of [1], and that algorithm clearly shows how the dot segments would be removed. But also, if you query to your triple store for everything the store knows about this: <http://rdfa.digitalbazaar.com/live/sneaking_sally.mp3> will you also get back information about: <http://rdfa.digitalbazaar.com/fuzzbot/demo/../../live/sneaking_sally.mp3> If you do, then that's great...but I'd also be really surprised; I would imagine that once the URI is in the store, it's treated pretty much like a string. > Is it the RDFa parser's job to normalize URLs? I can certainly see the > argument for why it should tidy up URLs, but I don't think this is a MUST. I think it should, for two reasons, one concerning RDFa in general, and the other relating to its particular manifestation as XHTML+RDFa. The first reason is that RDF deals with absolute URIs. So any relative paths have to be made absolute somehow, when creating triples. RFC 3986 [1] has a simple algorithm for doing this, which also has the effect of removing dot segments. So if we were not to use that algorithm to make relative paths absolute, which algorithm would we use? As you've discovered, simple concatenation doesn't work, since you keep finding another relative path that messes you up. The second reason is that XHTML+RDFa is a layer on top of XHTML. So what we're doing is giving a semantic *interpretation* of the underlying XHTML. To make this useful, we should really be generating the same triples for the same semantics. And if I say that the resource: <http://rdfa.digitalbazaar.com/live/sneaking_sally.mp3> is 5 minutes long, then the manner I use to express that at the XHTML level shouldn't affect the semantics that are generated. (As an aside, when parsing in HTML browsers, if you request the value of @href using getAttribute(), some browsers will give you the full, absolutised path, relative to the 'base' of the document and others will give you the original value put in there by the author, which could contain dot segments. So in those parsers you have to normalise, otherwise you won't achieve browser consistency.) > If it's not a MUST, then we find ourselves in a position where the > application/inference engine MUST normalize the URLs coming in from the > RDFa parser. It's not really 'normalising', it's using the proper algorithm to turn a relative path into an absolute one. That algorithm takes care of '.', '..', and all sorts of other things. Anyway, we have it in the spec, but you are right that we should perhaps consider making the wording both clearer and stronger. > Take this CURIE as an example: > > <span xmlns:ex="http://example.org/2008-10-24/docs/api/" > about="[ex:../ref/a.html]">...</span> > > a bit contrived, but would you say that the parser should output this URI: > > http://example.org/2008-10-24/docs/api/../ref/a.html > > or this one: > > http://example.org/2008-10-24/docs/ref/a.html The latter. Section 5.4.2, "Converting a CURIE to a URI" describes the following algorithm: Since a CURIE is merely a means for abbreviating a URI, its value is a URI, rather than the abbreviated form. Obtaining a URI from a CURIE involves the following steps: 1. Split the CURIE at the colon to obtain the prefix and the resource. 2. Using the prefix and the current in-scope mappings, obtain the URI that the prefix maps to. 3. Concatenate the mapped URI with the resource value, to obtain an absolute URI. After that description you'll see that there is a blue box that refers back to the earlier point about what it means to create absolute URIs from relative ones: Note that it is generally considered a good idea not to use relative paths in namespace declarations, but since it is possible that an author may ignore this guidance, it is further possible that the URI obtained from a CURIE is relative. However, since all URIs must be resolved relative to [base] before being used to create triples, the use of relative paths should not have any effect on processing. Now this doesn't quite deal with the example you gave; I was more dealing with this: <span xmlns:ex="/2008-10-24/docs/api/" about="[ex:../ref/a.html]">...</span> which when concatenated still only gives a relative path: /2008-10-24/docs/api/../ref/a.html The point that I was trying to stress when I wrote this was that this would still be ok, provided that you always use the algorithm in [1], and that algorithm would also take care of your example. However, I agree again that it wouldn't hurt to make this point more forcefully, but again, I think this is just about stress in the prose, rather than a fundamental issue. > If our argument is that CURIEs are simple concatenations, at what point > in the process is the "strange URL" converted into the "normalized URL"? I do my normalisation in the parser, before passing the results to the store. > If we do think it should be the parser that normalizes URLs, we don't > have such a statement in the RDFa Syntax document, do we? I think we do, as described above, re the note in 5.4.2 Regards, Mark [1] <http://gbiv.com/protocols/uri/rfc/rfc3986.html> -- Mark Birbeck, webBackplane mark.birbeck@webBackplane.com http://webBackplane.com/mark-birbeck webBackplane is a trading name of Backplane Ltd. (company number 05972288, registered office: 2nd Floor, 69/85 Tabernacle Street, London, EC2A 4RR)
Received on Tuesday, 29 July 2008 21:13:45 UTC