Re: Raptor 1.4.14 GRDDL 2006-10-24 implementation report from Dan Connolly on 2007-02-07 (public-grddl-comments@w3.org from January to March 2007)

From: Dan Connolly <connolly@w3.org>
Date: Wed, 07 Feb 2007 04:20:37 -0600
To: Dave Beckett <dave@dajobe.org>
Cc: public-grddl-comments@w3.org
Message-Id: <1170843637.7497.381.camel@dirk>
On Tue, 2007-02-06 at 23:42 -0800, Dave Beckett wrote:
> Hi GRDDLers
> 
> Raptor ( http://librdf.org/raptor/ ) has had simple GRDDL support for
> some time, but it never did recursion following profile or namespace
> URIs, to look for profiles and triples indirectly.  In version 1.4.14
> I finally added those features with some features for managing web
> URI retrieval. Thus raptor is closer - I wan't call it complete - to
> implementing GRDDL as specified.

Excellent...

> Raptor implements (mostly)
> http://www.w3.org/TR/2006/WD-grddl-20061024/
> W3C Working Draft 24 October 2006
> 
> (I ignored any WG documents in progress)
> 
> 
> Using the GRDDL WD sections as a guide to remind me of comments
> 
> 1. Introduction
> 
> 2. Adding GRDDL to well-formed XML
> 
> Raptor has implemented xmlns:data-view and data-view:transformation
> on the root element for some time.
> 
> The current WD mentions having non-XML results here such as Turtle .
> with issue-output-formats
> 
> This was tricky to deal with in XSLT and in the tests because
> the XSLT environment does not always return an output mime type,
> so I had to make Raptor do more guessing of the content
> in order to determine which parser to use, defaulting to
> RDF/XML if the information doesn't indicate otherwise.

That seems reasonable. Maybe I should add some advice
about setting MIME types in XSLT transformations.
(@@TODO).


> 3. GRDDL for XML Namespaces Documents
> 
> This is newly implemented by me in Raptor 1.4.14.
> 
> The two methods are:
>    1) "  if an information resource ?D  has an XML representation
>    whose root element has a namespace name ?NS then any GRDDL result
>    of the resource identified by ?NS  is a GRDDL result of ?D"
> 
> OK.  But is it not really saying that :
>    any GRDDL result of the resource identified by ?NS
>    is *included* in the GRDDL result of ?D
> i.e. in a set-of-triples inclusion sense.

Indeed, that was a design error; the rule has been re-written:

[[
If 
      * an information resource NSDOC, identified by a URI NS,
        represented by an XML document with root node NODE with a GRDDL
        result that includes a triple whose 
              * subject is NSDOC, whose
              * predicate is the property
                <http://www.w3.org/2003/g/data-view#namespaceTransformation>, and whose
              * object is TX,
      * and an information resource IR has an XML representation whose
        root element's namespace name is NS,
then TX is a GRDDL transformation of NODE
]]
 -- editor's draft
  http://www.w3.org/2004/01/rdxh/spec

>    2) " if an information resource ?D has an XML representation whose
>    root element has a namespace name ?NSDOC** and ?D has a GRDDL
>    result that includes, for any ?TX, the RDF triple { ?NSDOC
>    <http://www.w3.org/2003/g/data-view#namespaceTransformation> ?TX }
>    then ?TX is also a transformation of ?D"
> 
> So 2) builds on 1) since "?D has a GRDDL result" is what 1) defines.
> 
> Does that not imply 2) needs to be done after 1) ?

Perhaps that's the most straightforward implementation technique,
but the rules are declarative and insensitive to order.

> There is also some new terminology that's introduced:
>   - GRDDL result of the resource identified by ?NS
>   - a resource identified by ?NS ... is a GRDDL result of ?D
>   - [a resource?] ?TX is .. a transformation of [a resource] ?D
> 
> GRDDL result is special enough to deserve defining.

It is defined, no? by the rules.

> and "a transformation of" is something that could be expanded a bit.

The current editor's draft is considerably more elaborate
and precise on that sort of thing.


> I hard-coded not traversing the following commonly-seen namespace
> URIs which have no GRDDL right now, so it's wasted retrievals:
>   http://www.w3.org/1999/xhtml
>   http://www.w3.org/1999/02/22-rdf-syntax-ns#
>   http://www.w3.org/2001/XMLSchema
> 
> It might be questionable whether I should have included the RDF
> namespace, but I know right now it has no GRDDL.  issue-mt-ns
> mentions this.
>
> Is it legitimate to exclude some namespaces forever?

As long as they never change, yes. ;-)

The more relevant issue is issue-html-nsdoc: what caching policy should
GRDDL aware agents adopt for the XHTML namespace document?

We resolved to give advise rather than specify a policy:

[[
Some namespace documents, such as the XHTML namespace document
http://www.w3.org/1999/xhtml have very many references to them. If GRDDL
aware agents were to retrieve these documents every time they processed
a document referring to them, the origin servers of those documents
could become overloaded. GRDDL aware agents therefore should not
retrieve such documents on every reference and should retain some cache
or local memory of the transformations those documents indicate should
be applied. To avoid misrepresentation of published information, GRDDL
aware agents should ensure that this local memory is up to date and
should support user options to configure or disable the cache. See also
section section 3.1. Using a URI to Access a Resource of [WEBARCH].
]]


> 
>    4. The GRDDL profile for XHTML
> 
> This was implemented by earlier Raptor versions.
> 
> It might be worth repeating that the <head profile="..."> is a
> space-separated list of URIs and to look for the GRDDL profile you
> need to find it in that list.

Yes; the current draft is:

[[
Given an XPath root node R of an XHTML document, for each
space-separated token REF in the value of the profile attribute of the
head element E, the absolute form of REF with respect to the base URI of
E is a profile of R
]]

> 
>    5. GRDDL for HTML Profiles
> 
> One issue I found causing problems was whether to traverse the GRDDL
> profile URI itself, http://www.w3.org/2003/g/data-view

Oops; good question. The current editor's draft has a
"@@this section needs work" note in the part of the spec
that is a copy of that document.

> In the end I had to exclude it because the GRDDL profile document
> http://www.w3.org/2003/g/data-view contains an erdf profile, which
> refers to the GRDDL profile, so when you follow the natural
> evaluation order, you end up in a loop, or if like me, you were
> checking for urls already visited, the process terminated without
> having generated any triples at all.
> 
> i.e. http://www.w3.org/2003/g/data-view contains:
>   <html xmlns="http://www.w3.org/1999/xhtml">
>   <head profile="http://www.w3.org/2003/g/data-view
>                  http://purl.org/NET/erdf/profile">
> ...
> 
> and http://purl.org/NET/erdf/profile =>
> http://research.talis.com/2005/erdf/profile contains:
>    <head profile="http://www.w3.org/2003/g/data-view">
> 
> So for me, GRDDLing through the GRDDL profile URI does not work.
> Calling it direct - as the first URI - does work
> $ rapper -i grddl -c http://www.w3.org/2003/g/data-view
> rapper: Parsing URI http://www.w3.org/2003/g/data-view
> rapper: Parsing returned 40 triples
> 
> 
>    6. GRDDL Transformations
> 
> Raptor will do XSLT1 only for the forseable future since it depends
> on libxslt.
> 
> 
>    7. Security Considerations
> 
> This section should go beyond just XSLT issues and discuss
> - how GRDDL can cause the retrieval of many URIs

Hmm... yes, I guess that's worth noting, even though it
seems kinda obvious.
@@TODO

> - consideration of the rate of retrieval
> - what to do when you see the same URI twice

Now that's sufficiently orthogonal to GRDDL that
I'm inclinded to sweep it under the webarch rug;
we do cite webarch:
"See also section section 3.1. Using a URI to Access a Resource of
[WEBARCH]"

> - maybe suggest caching documents

yup. done.


>    8. The GRDDL Vocabulary
>    9. References
> 
> 
> 
> Order of Operation
> 
> Apart from getting the recursion mechanism implemented, it was tricky
> figure out what *order* to do some operations.

I just added a whole new appendix with a protocol trace to
give a typical lookup order.
http://www.w3.org/2004/01/rdxh/spec#extrace

>   The order I use is:
> 1) root element namespace
> 2) head profile
> 3) other in-document URIs (rel=transform, data-view: ...)
> 
> 
> Namespace/Profile transformation Triples
> 
> Do the triples mentioned in the formal descriptions get included into
> the "GRDDL result of ?x" being calculated?    In Raptor they do.

I'm not sure I understand the question. Do you have an example
handy?


> Base URIs
> 
> Several of the XSLT sheets used in the testsa assume that there is an
> XSLT parameter called 'base' or 'Base' set to the base URI of the
> document.  Otherwise tests fail:
> These are the ones with the assumption:
>    http://www.w3.org/2000/07/uri43/uri.xsl
>    http://www.w3.org/2000/08/w3c-synd/home2rss.xsl

ew.
@@TODO: take a closer look.


> I saw #base-param is still under discussion in
>   http://lists.w3.org/Archives/Public/public-grddl-wg/2007Jan/0059.html
> but I really don't understand that proposal.

We closed that issue without changing the spec much; I do touch
on it briefly in the new protocol trace appendix. And we have
a test case in progress (hmm... did we approve that one? I'm not sure).
http://www.w3.org/2001/sw/grddl-wg/td/testlist1#base-param


> Several of the examples assume something to do with base param and/or
> base URIs in the sheets above and others.
> 
> 
> Well known Transforms
> 
> I've also got some hard-coded XPaths in Raptor to find microformats
> in XHTML just by recognising the css class names and then using a
> "well-known" transformation.  I have disabled them now and probably
> will remove them from the code entirely
> 
>   DC: doesn't work, namespaces are wrong in the XSLT
>     XPath:
> /html:html/html:head/html:link[@href="http://purl.org/dc/elements/1.1/"]
>     XSLT: http://www.w3.org/2000/06/dc-extract/dc-extract.xsl
> 
>   eRDF
>     XPath:
> /html:html/html:head[contains(@profile,"http://purl.org/NET/erdf/profile")]
>     XSLT: http://purl.org/NET/erdf/extract-rdf.xsl
> 
>   hCalendar
>     XPath: //*[@class="vevent"]
>     XSLT: http://www.w3.org/2002/12/cal/glean-hcal.xsl
> 
> 
> GRDDL Tests
> 
> http://www.w3.org/2001/sw/grddl-wg/
> 
> I was using http://www.w3.org/2001/sw/grddl-wg/td/ taken from the web
> rather than mercurial (despite the directory names below).
> 
> It was a bit of a fuss to get the tests setup working

yeah; it's fairly raw, still.

>  as some parts
> use the swap python, the testtf uses rdflib python and 4suite (I
> didn't bother with installing that).
> 
> $ PYTHONPATH=$HOME/lib/python2.4/site-packages python testft.py  --run
> 'rapper -i grddl -q -o rdfxml' testlist1.rdf > raptor_earl.rdf
> rapper: Error - URI
> file:///home/dajobe/dev/rdf/grddl/homer.w3.org:8123/atom-grddl.xml:2 - XML
> parser error - Document is empty
> rapper: Failed to parse URI
> file:///home/dajobe/dev/rdf/grddl/homer.w3.org%3A8123/atom-grddl.xml grddl
> content
> *
> file:///home/dajobe/dev/rdf/grddl/homer.w3.org%3A8123/testlist1.rdf#atomttl1
> failed
> *
> file:///home/dajobe/dev/rdf/grddl/homer.w3.org%3A8123/testlist1.rdf#base-param
> failed
> $
> 
> raptor_earl.rdf is attached

Cool.

> Test failures
> 1) atomttl1
> I haven't figured this one out:
> 
> $ rapper -i grddl -q -o rdfxml atom-grddl.xml
> rapper: Error - URI
> file:///home/dajobe/dev/rdf/grddl/homer.w3.org:8123/atom-grddl.xml:2 - XML
> parser error - Document is empty
> <?xml version="1.0" encoding="utf-8"?>
> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
> </rdf:RDF>
> 
> 2) base-param
> $ rapper -i grddl -q -o rdfxml baseURI.html
> <?xml version="1.0" encoding="utf-8"?>
> <rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/"
> xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
>   <rdf:Description rdf:about="">
>     <dc:title>Input for Base Param Test Case</dc:title>
>   </rdf:Description>
> </rdf:RDF>
> 
> The test suite expects 2 triples, I return 1.

Ah... right; we haven't approved that one for that very
reason.

> The test expected result adds:
>    <ex:StyleSheet rdf:about="baseURI.xsl"/>
> but I don't see where that's from.

known bug.

> I tried the testlist2.rdf set but they all fail except for nmg-strawman#
> 
> 
> That's all folks

That's a lot!

> Cheers
> 
> Dave
-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E
Received on Wednesday, 7 February 2007 10:20:55 UTC