W3C home > Mailing lists > Public > public-html@w3.org > January 2010

Microdata to RDF conversion

From: Philip Jägenstedt <philipj@opera.com>
Date: Sun, 17 Jan 2010 21:24:38 +0100
To: "HTML WG" <public-html@w3.org>
Message-ID: <op.u6o1ncvxsr6mfa@worf>
http://dev.w3.org/html5/md/#rdf

I've reviewed and implemented this as part of microdatajs [1] and came  
across a few issues.


Several steps talk about "the language of the element", but it isn't  
entirely clear what this is. Should the "to determine the language of a  
node" algorithm be used, which finds the nearest ancestor with a lang  
attribute?


Is there any particular reason for the uppercase token  
ALTERNATE-STYLESHEET? Wouldn't it be better to normalize the  
capitalization of all case-insensitive tokens to lowercase? (because it  
looks nicer)


This algorithm uses the http://purl.org/dc/terms/ namespace, while the  
mapping at <http://dev.w3.org/html5/md/#conversion-to-rdf> uses the  
http://purl.org/dc/elements/1.1/ namespace. http://purl.org/dc/terms/  
seems to be the canonical namespace at this time, so I suggest just using  
that.


What is the reasoning behind the steps for "If name contains no U+003A  
COLON character (:)"? I assume that # is added to normalize URLs that end  
with # where people sometimes just remove that. But what's the colon for?  
Some non-normative explanation of the monster URLs that these steps  
produce would be helpful.


There's an issue with how vocabularies that use subitems are currently  
handled. In short, triples are only generated if the item either has a  
type which is an absolute URL or if the item property is an absolute URL.  
This prevents site-private data from being exported as RDF, which is a  
good thing. However, for vocabularies which have an item type for the  
top-level item but not for subitems (which seems quite unnecessary) this  
means that no triples are generated for the subitems, even though the  
subitem reasonably be considered to be using the same vocabulary as the  
typed top-level item. To illustrate the point, here's the output of the  
RDF extraction (as Turtle) from the Jack Bauer example if the current spec  
is honored:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix hcard:  
<http://www.w3.org/1999/xhtml/microdata#http%3A%2F%2Fmicroformats.org%2Fprofile%2Fhcard%23%3A>  
.

<http://foolip.org/microdatajs/demo/turtle.html> dcterms:title "RDF/Turtle  
demo" ;
	<http://www.w3.org/1999/xhtml/microdata#item> _:n0 .
_:n0 rdf:type <http://microformats.org/profile/hcard> ;
      hcard:fn "Jack Bauer" ;
      hcard:photo <http://worf.foolip.org/microdatajs/demo/jack-bauer.jpg> ;
      hcard:org _:n1 ;
      hcard:adr _:n2 ;
      hcard:geo "34.052339;-118.410623" ;
      hcard:tel _:n3 ;
      hcard:url <http://en.wikipedia.org/wiki/Jack_Bauer> ;
      hcard:url <http://www.jackbauerfacts.com/> ;
      hcard:email "j.bauer@la.ctu.gov.invalid" ;
      hcard:tel _:n4 ;
      hcard:note "If I'm \"out in the field\", you may be better off\n  
contacting Chloe O'Brian if it's about\n work, or ask Tony Almeida if\n  
you're interested in the CTU five-a-side football team we're trying\n to  
get going." ;
      hcard:agent _:n5 ;
      hcard:agent "Tony Almeida" ;
      hcard:rev _:n6 ;
      hcard:tel _:n7 .
_:n5 rdf:type <http://microformats.org/profile/hcard> ;
      hcard:email <mailto:c.obrian@la.ctu.gov.invalid> ;
      hcard:fn "Chloe O'Brian" .

As you see, the structured subitems org, adr, etc just point to blank  
nodes with no further triples for those nodes. My fix is to pass on the  
type of the parent item when generating triples for subitems as a default,  
which is overridden if the subitem defines its own type (as e.g. agent  
does in the above). I think this is sensible and it certainly produces a  
more complete RDF graph:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix hcard:  
<http://www.w3.org/1999/xhtml/microdata#http%3A%2F%2Fmicroformats.org%2Fprofile%2Fhcard%23%3A>  
.

<http://foolip.org/microdatajs/demo/turtle.html> dcterms:title "RDF/Turtle  
demo" ;
	<http://www.w3.org/1999/xhtml/microdata#item> _:n0 .
_:n0 rdf:type <http://microformats.org/profile/hcard> ;
      hcard:fn "Jack Bauer" ;
      hcard:photo <http://worf.foolip.org/microdatajs/demo/jack-bauer.jpg> ;
      hcard:org _:n1 ;
      hcard:adr _:n2 ;
      hcard:geo "34.052339;-118.410623" ;
      hcard:tel _:n3 ;
      hcard:url <http://en.wikipedia.org/wiki/Jack_Bauer> ;
      hcard:url <http://www.jackbauerfacts.com/> ;
      hcard:email "j.bauer@la.ctu.gov.invalid" ;
      hcard:tel _:n4 ;
      hcard:note "If I'm \"out in the field\", you may be better off\n  
contacting Chloe O'Brian if it's about\n work, or ask Tony Almeida if\n  
you're interested in the CTU five-a-side football team we're trying\n to  
get going." ;
      hcard:agent _:n5 ;
      hcard:agent "Tony Almeida" ;
      hcard:rev _:n6 ;
      hcard:tel _:n7 .
_:n1 hcard:organization-name "Counter-Terrorist Unit" ;
      hcard:organization-unit "Los Angeles Division" .
_:n2 hcard:street-address "10201 W. Pico Blvd." ;
      hcard:locality "Los Angeles" ;
      hcard:region "CA" ;
      hcard:postal-code "90064" ;
      hcard:country-name "United States" .
_:n3 hcard:value "+1 (310)\n  597 3781" ;
      hcard:type "work" ;
      hcard:type "pref" .
_:n4 hcard:value "+1 (310) 555\n  3781" ;
      hcard:type "cell" .
_:n5 rdf:type <http://microformats.org/profile/hcard> ;
      hcard:email <mailto:c.obrian@la.ctu.gov.invalid> ;
      hcard:fn "Chloe O'Brian" .
_:n6 hcard:type "date-time" ;
      hcard:value "2008-07-20T21:00:00+01:00" .
_:n7 hcard:type "home" ;
      hcard:value "01632 960 123" .

This look good to me, but I'm no RDF expert, so feedback on whether these  
triples are useful and can easily be mapped other vocabularies would be  
nice. (Note that my Turtle export pretty-prints a bit and adds some common  
prefixes for readability, but that's not part of the microdata spec, which  
has as its output abstract RDF triples with nothing to say about  
serialization.)


Finally, some questions on how to apply the requirements of  
<http://dev.w3.org/html5/md/#conversion-to-rdf>. I simply filtered the  
triples a bit before outputting them, but is this the intended solution?  
The first requirement is 'For the purposes of RDF processors, blank nodes  
that are the subject of a triple with the predicate  
"http://www.w3.org/1999/xhtml/microdata#http%3A%2F%2Fn.whatwg.org%2Fwork%23%3Awork"  
and the object s must be treated as if the node was identified by s.' Can  
this be expressed using OWL? The last 3 requirements are simple predicate  
equivalences and can be expressed with owl:equivalentProperty, I think. If  
all of these requirements can in fact be expressed using OWL, adding  
non-normative text stating what exact triples accomplish that would be  
helpful.

[1] http://gitorious.org/microdatajs

-- 
Philip Jägenstedt
Received on Sunday, 17 January 2010 20:24:37 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:16:58 GMT