Raptor 1.4.14 GRDDL 2006-10-24 implementation report

Hi GRDDLers

Raptor ( http://librdf.org/raptor/ ) has had simple GRDDL support for
some time, but it never did recursion following profile or namespace
URIs, to look for profiles and triples indirectly.  In version 1.4.14
I finally added those features with some features for managing web
URI retrieval. Thus raptor is closer - I wan't call it complete - to
implementing GRDDL as specified.

Raptor implements (mostly)
http://www.w3.org/TR/2006/WD-grddl-20061024/
W3C Working Draft 24 October 2006

(I ignored any WG documents in progress)


Using the GRDDL WD sections as a guide to remind me of comments

1. Introduction

2. Adding GRDDL to well-formed XML

Raptor has implemented xmlns:data-view and data-view:transformation
on the root element for some time.

The current WD mentions having non-XML results here such as Turtle .
with issue-output-formats

This was tricky to deal with in XSLT and in the tests because
the XSLT environment does not always return an output mime type,
so I had to make Raptor do more guessing of the content
in order to determine which parser to use, defaulting to
RDF/XML if the information doesn't indicate otherwise.


3. GRDDL for XML Namespaces Documents

This is newly implemented by me in Raptor 1.4.14.

The two methods are:
   1) "  if an information resource ?D  has an XML representation
   whose root element has a namespace name ?NS then any GRDDL result
   of the resource identified by ?NS  is a GRDDL result of ?D"

OK.  But is it not really saying that :
   any GRDDL result of the resource identified by ?NS
   is *included* in the GRDDL result of ?D
i.e. in a set-of-triples inclusion sense.

   2) " if an information resource ?D has an XML representation whose
   root element has a namespace name ?NSDOC** and ?D has a GRDDL
   result that includes, for any ?TX, the RDF triple { ?NSDOC
   <http://www.w3.org/2003/g/data-view#namespaceTransformation> ?TX }
   then ?TX is also a transformation of ?D"

So 2) builds on 1) since "?D has a GRDDL result" is what 1) defines.

Does that not imply 2) needs to be done after 1) ?


There is also some new terminology that's introduced:
  - GRDDL result of the resource identified by ?NS
  - a resource identified by ?NS ... is a GRDDL result of ?D
  - [a resource?] ?TX is .. a transformation of [a resource] ?D

GRDDL result is special enough to deserve defining.

and "a transformation of" is something that could be expanded a bit.


I hard-coded not traversing the following commonly-seen namespace
URIs which have no GRDDL right now, so it's wasted retrievals:
  http://www.w3.org/1999/xhtml
  http://www.w3.org/1999/02/22-rdf-syntax-ns#
  http://www.w3.org/2001/XMLSchema

It might be questionable whether I should have included the RDF
namespace, but I know right now it has no GRDDL.  issue-mt-ns
mentions this.

Is it legitimate to exclude some namespaces forever?


   4. The GRDDL profile for XHTML

This was implemented by earlier Raptor versions.

It might be worth repeating that the <head profile="..."> is a
space-separated list of URIs and to look for the GRDDL profile you
need to find it in that list.


   5. GRDDL for HTML Profiles

One issue I found causing problems was whether to traverse the GRDDL
profile URI itself, http://www.w3.org/2003/g/data-view

In the end I had to exclude it because the GRDDL profile document
http://www.w3.org/2003/g/data-view contains an erdf profile, which
refers to the GRDDL profile, so when you follow the natural
evaluation order, you end up in a loop, or if like me, you were
checking for urls already visited, the process terminated without
having generated any triples at all.

i.e. http://www.w3.org/2003/g/data-view contains:
  <html xmlns="http://www.w3.org/1999/xhtml">
  <head profile="http://www.w3.org/2003/g/data-view
                 http://purl.org/NET/erdf/profile">
...

and http://purl.org/NET/erdf/profile =>
http://research.talis.com/2005/erdf/profile contains:
   <head profile="http://www.w3.org/2003/g/data-view">

So for me, GRDDLing through the GRDDL profile URI does not work.
Calling it direct - as the first URI - does work
$ rapper -i grddl -c http://www.w3.org/2003/g/data-view
rapper: Parsing URI http://www.w3.org/2003/g/data-view
rapper: Parsing returned 40 triples


   6. GRDDL Transformations

Raptor will do XSLT1 only for the forseable future since it depends
on libxslt.


   7. Security Considerations

This section should go beyond just XSLT issues and discuss
- how GRDDL can cause the retrieval of many URIs
- consideration of the rate of retrieval
- what to do when you see the same URI twice
- maybe suggest caching documents

   8. The GRDDL Vocabulary
   9. References



Order of Operation

Apart from getting the recursion mechanism implemented, it was tricky
figure out what *order* to do some operations.  The order I use is:
1) root element namespace
2) head profile
3) other in-document URIs (rel=transform, data-view: ...)


Namespace/Profile transformation Triples

Do the triples mentioned in the formal descriptions get included into
the "GRDDL result of ?x" being calculated?    In Raptor they do.


Base URIs

Several of the XSLT sheets used in the testsa assume that there is an
XSLT parameter called 'base' or 'Base' set to the base URI of the
document.  Otherwise tests fail:
These are the ones with the assumption:
   http://www.w3.org/2000/07/uri43/uri.xsl
   http://www.w3.org/2000/08/w3c-synd/home2rss.xsl

I saw #base-param is still under discussion in
  http://lists.w3.org/Archives/Public/public-grddl-wg/2007Jan/0059.html
but I really don't understand that proposal.

Several of the examples assume something to do with base param and/or
base URIs in the sheets above and others.


Well known Transforms

I've also got some hard-coded XPaths in Raptor to find microformats
in XHTML just by recognising the css class names and then using a
"well-known" transformation.  I have disabled them now and probably
will remove them from the code entirely

  DC: doesn't work, namespaces are wrong in the XSLT
    XPath:
/html:html/html:head/html:link[@href="http://purl.org/dc/elements/1.1/"]
    XSLT: http://www.w3.org/2000/06/dc-extract/dc-extract.xsl

  eRDF
    XPath:
/html:html/html:head[contains(@profile,"http://purl.org/NET/erdf/profile")]
    XSLT: http://purl.org/NET/erdf/extract-rdf.xsl

  hCalendar
    XPath: //*[@class="vevent"]
    XSLT: http://www.w3.org/2002/12/cal/glean-hcal.xsl


GRDDL Tests

http://www.w3.org/2001/sw/grddl-wg/

I was using http://www.w3.org/2001/sw/grddl-wg/td/ taken from the web
rather than mercurial (despite the directory names below).

It was a bit of a fuss to get the tests setup working as some parts
use the swap python, the testtf uses rdflib python and 4suite (I
didn't bother with installing that).

$ PYTHONPATH=$HOME/lib/python2.4/site-packages python testft.py  --run
'rapper -i grddl -q -o rdfxml' testlist1.rdf > raptor_earl.rdf
rapper: Error - URI
file:///home/dajobe/dev/rdf/grddl/homer.w3.org:8123/atom-grddl.xml:2 - XML
parser error - Document is empty
rapper: Failed to parse URI
file:///home/dajobe/dev/rdf/grddl/homer.w3.org%3A8123/atom-grddl.xml grddl
content
*
file:///home/dajobe/dev/rdf/grddl/homer.w3.org%3A8123/testlist1.rdf#atomttl1
failed
*
file:///home/dajobe/dev/rdf/grddl/homer.w3.org%3A8123/testlist1.rdf#base-param
failed
$

raptor_earl.rdf is attached


Test failures
1) atomttl1
I haven't figured this one out:

$ rapper -i grddl -q -o rdfxml atom-grddl.xml
rapper: Error - URI
file:///home/dajobe/dev/rdf/grddl/homer.w3.org:8123/atom-grddl.xml:2 - XML
parser error - Document is empty
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
</rdf:RDF>

2) base-param
$ rapper -i grddl -q -o rdfxml baseURI.html
<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="">
    <dc:title>Input for Base Param Test Case</dc:title>
  </rdf:Description>
</rdf:RDF>

The test suite expects 2 triples, I return 1.

The test expected result adds:
   <ex:StyleSheet rdf:about="baseURI.xsl"/>
but I don't see where that's from.

I tried the testlist2.rdf set but they all fail except for nmg-strawman#


That's all folks

Cheers

Dave

Received on Wednesday, 7 February 2007 07:42:45 UTC