Re: Raptor 1.4.14 GRDDL 2006-10-24 implementation report from Dave Beckett on 2007-02-10 (public-grddl-comments@w3.org from January to March 2007)

From: Dave Beckett <dave@dajobe.org>
Date: Sat, 10 Feb 2007 00:43:12 -0800
To: Dan Connolly <connolly@w3.org>
CC: public-grddl-comments@w3.org
Message-ID: <45CD85A0.5000907@dajobe.org>
Dan Connolly wrote:
> On Tue, 2007-02-06 at 23:42 -0800, Dave Beckett wrote:
>> Hi GRDDLers
...
>> Raptor implements (mostly)
>> http://www.w3.org/TR/2006/WD-grddl-20061024/
>> W3C Working Draft 24 October 2006
>>
> 
>> 3. GRDDL for XML Namespaces Documents
...
>> I hard-coded not traversing the following commonly-seen namespace
>> URIs which have no GRDDL right now, so it's wasted retrievals:
>>   http://www.w3.org/1999/xhtml
>>   http://www.w3.org/1999/02/22-rdf-syntax-ns#
>>   http://www.w3.org/2001/XMLSchema
>>
>> It might be questionable whether I should have included the RDF
>> namespace, but I know right now it has no GRDDL.  issue-mt-ns
>> mentions this.
>>
>> Is it legitimate to exclude some namespaces forever?
> 
> As long as they never change, yes. ;-)
> 
> The more relevant issue is issue-html-nsdoc: what caching policy should
> GRDDL aware agents adopt for the XHTML namespace document?
> 
> We resolved to give advise rather than specify a policy:
> 
> [[
> Some namespace documents, such as the XHTML namespace document
> http://www.w3.org/1999/xhtml have very many references to them. If GRDDL
> aware agents were to retrieve these documents every time they processed
> a document referring to them, the origin servers of those documents
> could become overloaded. GRDDL aware agents therefore should not
> retrieve such documents on every reference and should retain some cache
> or local memory of the transformations those documents indicate should
> be applied. To avoid misrepresentation of published information, GRDDL
> aware agents should ensure that this local memory is up to date and
> should support user options to configure or disable the cache. See also
> section section 3.1. Using a URI to Access a Resource of [WEBARCH].
> ]]

Caching failures as well as successful retrievals and successful
transformations.



> 
>>    4. The GRDDL profile for XHTML
>>
>> This was implemented by earlier Raptor versions.
>>
>> It might be worth repeating that the <head profile="..."> is a
>> space-separated list of URIs and to look for the GRDDL profile you
>> need to find it in that list.
> 
> Yes; the current draft is:
> 
> [[
> Given an XPath root node R of an XHTML document, for each
> space-separated token REF in the value of the profile attribute of the
> head element E, the absolute form of REF with respect to the base URI of
> E is a profile of R
> ]]

Good, that works.

>>    5. GRDDL for HTML Profiles
>>
>> One issue I found causing problems was whether to traverse the GRDDL
>> profile URI itself, http://www.w3.org/2003/g/data-view
> 
> Oops; good question. The current editor's draft has a
> "@@this section needs work" note in the part of the spec
> that is a copy of that document.
> 
>> In the end I had to exclude it because the GRDDL profile document
>> http://www.w3.org/2003/g/data-view contains an erdf profile, which
>> refers to the GRDDL profile, so when you follow the natural
>> evaluation order, you end up in a loop, or if like me, you were
>> checking for urls already visited, the process terminated without
>> having generated any triples at all.
>>
>> i.e. http://www.w3.org/2003/g/data-view contains:
>>   <html xmlns="http://www.w3.org/1999/xhtml">
>>   <head profile="http://www.w3.org/2003/g/data-view
>>                  http://purl.org/NET/erdf/profile">
>> ...
>>
>> and http://purl.org/NET/erdf/profile =>
>> http://research.talis.com/2005/erdf/profile contains:
>>    <head profile="http://www.w3.org/2003/g/data-view">
>>
>> So for me, GRDDLing through the GRDDL profile URI does not work.
>> Calling it direct - as the first URI - does work
>> $ rapper -i grddl -c http://www.w3.org/2003/g/data-view
>> rapper: Parsing URI http://www.w3.org/2003/g/data-view
>> rapper: Parsing returned 40 triples

I hope this is clear, let me expand on the rough process
I use:

GRDDL(URI D)
1. Retrieve URI D
2. Mark URI D as visited
3. GRDDL root-element namespace URI
4. Discover the content contains a <head profile> with GRDDL profile URI
5. For each URI in the profile, do a recursive GRDDL():
   5.1 Mark the GRDDL profile URI as visited
   5.2 Retrieve the GRDDL profile URI
   5.3 GRDDL root-element namespace URI
   5.4 Discover the content contains a <head profile> with
       URI http://purl.org/NET/erdf/profile
   5.5 For each URI in the profile, do a recursive GRDDL():
       5.5.1 Mark the URI http://purl.org/NET/erdf/profile as visited
       5.5.2 Retrieve the URI http://purl.org/NET/erdf/profile
       5.5.3 GRDDL the root-element namespace URI
       5.5.4 Discover the content contains a <head profile> with
             the GRDDL profile URI
       5.5.5 Already visited that so do not recurse**
       5.5.6 GRDDL in-document transformations
   5.6 GRDDL in-document transformations
6. GRDDL in-document transformations

**If you didn't do this, or marked visited flags after
recursing, you would end up in a loop.

Since the GRDDL profile URI document currently calls the eRDF profile
and the sequence above found no transforms, thus no eRDF is found inside the
GRDDL profile document.  So you get less triples that you might expect.

I think more than just declaring the GRDDL rules is needed,
you need to say something about processing order as
well as how the GRDDL profile URI and document are handled.

...
>>
>> Order of Operation
>>
>> Apart from getting the recursion mechanism implemented, it was tricky
>> figure out what *order* to do some operations.
> 
> I just added a whole new appendix with a protocol trace to
> give a typical lookup order.
> http://www.w3.org/2004/01/rdxh/spec#extrace

This trace gives an example of an XSL result with:
<xsl:output method="xml" indent="yes" />

that turns out to be actually application/rdf+xml

This is an example of guessing, since raptor *can* handle
other XML formats, so even if there was a mime type
argument on xsl:output, assumes method="xml" and
application/xml both mean application/rdf+xml

...
>> Namespace/Profile transformation Triples
>>
>> Do the triples mentioned in the formal descriptions get included into
>> the "GRDDL result of ?x" being calculated?    In Raptor they do.
> 
> I'm not sure I understand the question. Do you have an example
> handy?

The triples like:
    * subject is NSDOC
    * predicate is the property
<http://www.w3.org/2003/g/data-view#namespaceTransformation>
    * object is TX

and ditto with grddl:profileTransformation predicates.

at least if I read http://www.w3.org/2004/01/rdxh/spec right,
these triples are in the GRDDL result.


>> Base URIs
>>
>> Several of the XSLT sheets used in the testsa assume that there is an
>> XSLT parameter called 'base' or 'Base' set to the base URI of the
>> document.  Otherwise tests fail:
>> These are the ones with the assumption:
>>    http://www.w3.org/2000/07/uri43/uri.xsl
>>    http://www.w3.org/2000/08/w3c-synd/home2rss.xsl
> 
> ew.
> @@TODO: take a closer look.

I found this by GRDDLing http://www.w3.org/
It points at
<head profile="http://www.w3.org/2000/08/w3c-synd/#">...

Although this has no mention of GRDDL and so strictly GRDDL
operations should not be done, the profile URI document has:
 <link rel="transformation"
href="http://www.w3.org/2003/12/rdf-in-xhtml-xslts/grddlProfileTransformation.xsl"/>
and
<a rel="profileTransformation" href="home2rss.xsl">profiletransformation</a>

both
  <xsl:import href="http://www.w3.org/2000/07/uri43/uri.xsl"/>
which needs Base when used.

A direct example:
$ xsltproc http://www.w3.org/2000/08/w3c-synd/home2rss.xsl http://www.w3.org/
emits a big pile of messages to stdout from xsl:message
wheras you get silence with:
$ xsltproc --param Base '"http://www.w3.org/"'
http://www.w3.org/2000/08/w3c-synd/home2rss.xsl http://www.w3.org/

> 
> 
>> I saw #base-param is still under discussion in
>>   http://lists.w3.org/Archives/Public/public-grddl-wg/2007Jan/0059.html
>> but I really don't understand that proposal.
> 
> We closed that issue without changing the spec much; I do touch
> on it briefly in the new protocol trace appendix. And we have
> a test case in progress (hmm... did we approve that one? I'm not sure).
> http://www.w3.org/2001/sw/grddl-wg/td/testlist1#base-param

If 'Base' or 'base' param is not required when invoking XSLT
then if the above XSLTs are legitimate examples, they need to
be fixed although having such a param seems to be helpful.

...
>> GRDDL Tests
>>
>> http://www.w3.org/2001/sw/grddl-wg/
> ...

As discussed on IRC, I was out of date with the tests,
so I have re-run them against Raptor Subversion and all of them pass.

$ (PYTHONPATH=$HOME/lib/python2.4/site-packages python testft.py  --run
'/home/dajobe/dev/redland/raptor/utils/rapper -i grddl -q -o rdfxml'
testlist1.rdf  > raptor_earl.rdf ) 2>&1 |grep -v raptor_
All tests were passed!
$

the atomttl one now passes because the heuristics I've added guess
the content might be parseable as turtle, despite the mime type of
'text/rdf+n3' which is N3's.


I attach the updated EARL output.  It still generates file:
URIs for the tests, since that's the way the '--run' is invoked above.

Dave
Attachments

application/rdf+xml attachment: raptor_earl.rdf
Received on Saturday, 10 February 2007 08:43:29 UTC