Parsing RDFa in Feeds from Toby A Inkster on 2009-01-13 (public-rdfa@w3.org from January 2009)

From: Toby A Inkster <tai@g5n.co.uk>
Date: Tue, 13 Jan 2009 10:54:00 +0000
To: RDFa <public-rdf-in-xhtml-tf@w3.org>, public-rdfa@w3.org
Message-Id: <E213BEFF-532C-4F48-B041-458BA3F73464@g5n.co.uk>

I've recently implemented support for this in Swignition <http:// 
buzzword.org.uk/swignition/> and thought I'd share my technique.

First I use Raptor <http://librdf.org/raptor/> to parse the feed.  
This results in a graph (which we'll call "G") including a number of  
resources with rdf:type <http://purl.org/rss/1.0/item>. I loop  
through these resources, and for each resource (which we'll call "R"):

1. If R does not have a content:encoded predicate, ignore it and go  
on to the next resource. Note that the full URI for content:encoded  
is <http://purl.org/rss/1.0/modules/content/encoded>, but some  
versions of Raptor erroneously use <http://web.resource.org/rss/1.0/ 
modules/content/encoded>, so you should check both. (I have different  
versions of librdf on my laptop and desktop, so come across this sort  
of thing all the time!)

2. Concatenate "<html>" then the content:encoded literal (hopefully  
there will be only one) then "</html>". Pass this through a tag soup  
HTML to valid XHTML conversion routine.

3. Parse the XHTML as RDFa with a base URI equal to R's URI. This  
results in a graph "H".

4. Merge the triples from graph H into graph G taking care not to  
confuse similarly-named blank nodes. (i.e. if G contains a node _:Foo  
and H also contains a node _:Foo, then these should not be treated as  
the same node in the merged graph.)

In the end, all the data are belong to G.

Open question: should XML namespaces used in the Feed be "inherited"  
as CURIE prefixes within the XHTML parsed in the step labelled "3"? I  
can see arguments either way. Overall, I feel that they should not.

-- 
Toby A Inkster
<mailto:mail@tobyinkster.co.uk>
<http://tobyinkster.co.uk>

Received on Tuesday, 13 January 2009 10:54:50 UTC