Re: RDFa test suite addition from Mark Birbeck on 2008-07-29 (public-rdf-in-xhtml-tf@w3.org from July 2008)

From: Mark Birbeck <mark.birbeck@webbackplane.com>
Date: Tue, 29 Jul 2008 22:13:07 +0100
To: "Manu Sporny" <msporny@digitalbazaar.com>
Cc: "RDFa mailing list" <public-rdf-in-xhtml-tf@w3.org>
Message-ID: <ed77aa9f0807291413i609e5d37ra09c977a072af6c5@mail.gmail.com>
Hi Manu,

> Right, I didn't mean to imply that 'appending' will work in all cases
> (even though I'm not convinced that the statement is not true).

You began this thread by saying that the bug in librdfa was that in
some circumstances the relative part was incorrectly being appended to
the document, rather than the host name. So haven't you proved it
yourself, that appending doesn't work in all cases? :)

What's happened is that you've now found two possible ways to append
(to the document part or to the hostname), but I'm afraid that the
algorithm for converting a relative URI to an absolute one involves
yet further possibilities.

For example, the relative path:

  sneaking_sally.mp3

should be appended to the end of the *path* part, replacing the
document. And so on.

So the point is to use the 'proper' algorithm for turning a relative
path into an absolute one, and you will always be ok, no matter what
the URI is that you are dealing with (relative or not).

The big question then, is whether the spec actually says to do this.


> What you have said has got me wondering about what is correct,
> acceptable and incorrect, however.

You also had me wondering, too. I recalled investigating this quite a
long time ago, and was starting to panic that I hadn't actually
incorporated what I learned from my analysis into the spec.

But thankfully I did:

  5.4. CURIE and URI Processing

  Since RDFa is ultimately a means for transporting RDF, then a key
concept is the
  resource and its manifestation as a URI. Since RDF deals with
complete URIs (not
  relative paths), then when converting RDFa to triples, any relative
URIs will need to
  be resolved relative to the base URI, using the algorithm defined in
section 5 of RFC
  3986 [URI], Reference Resolution.

It certainly sounds like this point could do with being made more
prominent, but hopefully you'll agree that such changes would merely
be editorial, and that the spec itself is correct.

(See below for a further mention in the spec of this issue, but in the
context of CURIEs.)


>>   <http://rdfa.digitalbazaar.com/fuzzbot/demo/../../live/sneaking_sally.mp3>
>
> I realize that the URL above is not optimal, but is it "wrong"? RFC-1738
> says that the URL is valid (if I'm reading the RFC correctly):
>
> ftp://ftp.isi.edu/in-notes/rfc1738.txt

First, note that [1] updates RFC 1738.

Second, you're right that the URI is not 'wrong'. But the only way to
obtain such a URI would be to enter it exactly as you have shown it.
I.e., tou can't create such a URI by beginning with a relative path
and making it absolute, since the the only way to do that is according
to section 5 of [1], and that algorithm clearly shows how the dot
segments would be removed.

But also, if you query to your triple store for everything the store
knows about this:

  <http://rdfa.digitalbazaar.com/live/sneaking_sally.mp3>

will you also get back information about:

  <http://rdfa.digitalbazaar.com/fuzzbot/demo/../../live/sneaking_sally.mp3>

If you do, then that's great...but I'd also be really surprised; I
would imagine that once the URI is in the store, it's treated pretty
much like a string.


> Is it the RDFa parser's job to normalize URLs? I can certainly see the
> argument for why it should tidy up URLs, but I don't think this is a MUST.

I think it should, for two reasons, one concerning RDFa in general,
and the other relating to its particular manifestation as XHTML+RDFa.

The first reason is that RDF deals with absolute URIs. So any relative
paths have to be made absolute somehow, when creating triples. RFC
3986 [1] has a simple algorithm for doing this, which also has the
effect of removing dot segments.

So if we were not to use that algorithm to make relative paths
absolute, which algorithm would we use? As you've discovered, simple
concatenation doesn't work, since you keep finding another relative
path that messes you up.

The second reason is that XHTML+RDFa is a layer on top of XHTML. So
what we're doing is giving a semantic *interpretation* of the
underlying XHTML. To make this useful, we should really be generating
the same triples for the same semantics. And if I say that the
resource:

  <http://rdfa.digitalbazaar.com/live/sneaking_sally.mp3>

is 5 minutes long, then the manner I use to express that at the XHTML
level shouldn't affect the semantics that are generated.

(As an aside, when parsing in HTML browsers, if you request the value
of @href using getAttribute(), some browsers will give you the full,
absolutised path, relative to the 'base' of the document and others
will give you the original value put in there by the author, which
could contain dot segments. So in those parsers you have to normalise,
otherwise you won't achieve browser consistency.)


> If it's not a MUST, then we find ourselves in a position where the
> application/inference engine MUST normalize the URLs coming in from the
> RDFa parser.

It's not really 'normalising', it's using the proper algorithm to turn
a relative path into an absolute one. That algorithm takes care of
'.', '..', and all sorts of other things.

Anyway, we have it in the spec, but you are right that we should
perhaps consider making the wording both clearer and stronger.


> Take this CURIE as an example:
>
> <span xmlns:ex="http://example.org/2008-10-24/docs/api/"
>      about="[ex:../ref/a.html]">...</span>
>
> a bit contrived, but would you say that the parser should output this URI:
>
> http://example.org/2008-10-24/docs/api/../ref/a.html
>
> or this one:
>
> http://example.org/2008-10-24/docs/ref/a.html

The latter.

Section 5.4.2, "Converting a CURIE to a URI" describes the following algorithm:

  Since a CURIE is merely a means for abbreviating a URI, its value is
a URI, rather
  than the abbreviated form. Obtaining a URI from a CURIE involves the
following steps:

  1. Split the CURIE at the colon to obtain the prefix and the resource.
  2. Using the prefix and the current in-scope mappings, obtain the URI that the
  prefix maps to.
  3. Concatenate the mapped URI with the resource value, to obtain an
absolute URI.

After that description you'll see that there is a blue box that refers
back to the earlier point about what it means to create absolute URIs
from relative ones:

  Note that it is generally considered a good idea not to use relative
paths in namespace
  declarations, but since it is possible that an author may ignore
this guidance, it is further
  possible that the URI obtained from a CURIE is relative. However,
since all URIs must
  be resolved relative to [base] before being used to create triples,
the use of relative paths
  should not have any effect on processing.

Now this doesn't quite deal with the example you gave; I was more
dealing with this:

  <span xmlns:ex="/2008-10-24/docs/api/"
   about="[ex:../ref/a.html]">...</span>

which when concatenated still only gives a relative path:

  /2008-10-24/docs/api/../ref/a.html

The point that I was trying to stress when I wrote this was that this
would still be ok, provided that you always use the algorithm in [1],
and that algorithm would also take care of your example.

However, I agree again that it wouldn't hurt to make this point more
forcefully, but again, I think this is just about stress in the prose,
rather than a fundamental issue.


> If our argument is that CURIEs are simple concatenations, at what point
> in the process is the "strange URL" converted into the "normalized URL"?

I do my normalisation in the parser, before passing the results to the store.


> If we do think it should be the parser that normalizes URLs, we don't
> have such a statement in the RDFa Syntax document, do we?

I think we do, as described above, re the note in 5.4.2

Regards,

Mark

[1] <http://gbiv.com/protocols/uri/rfc/rfc3986.html>

-- 
Mark Birbeck, webBackplane

mark.birbeck@webBackplane.com

http://webBackplane.com/mark-birbeck

webBackplane is a trading name of Backplane Ltd. (company number
05972288, registered office: 2nd Floor, 69/85 Tabernacle Street,
London, EC2A 4RR)
Received on Tuesday, 29 July 2008 21:13:45 UTC