W3C home > Mailing lists > Public > public-html@w3.org > January 2010

Re: Microdata to RDF conversion

From: Maciej Stachowiak <mjs@apple.com>
Date: Sun, 17 Jan 2010 15:13:14 -0800
Cc: HTML WG <public-html@w3.org>
Message-id: <F4ABAF92-5449-4AEA-8BE3-546340DE04F8@apple.com>
To: Philip Jägenstedt <philipj@opera.com>

I suggest getting these in bugzilla (unless there are any where you are unsure if it's a problem).

 - Maciej

On Jan 17, 2010, at 12:24 PM, Philip Jägenstedt wrote:

> http://dev.w3.org/html5/md/#rdf
> 
> I've reviewed and implemented this as part of microdatajs [1] and came across a few issues.
> 
> 
> Several steps talk about "the language of the element", but it isn't entirely clear what this is. Should the "to determine the language of a node" algorithm be used, which finds the nearest ancestor with a lang attribute?
> 
> 
> Is there any particular reason for the uppercase token ALTERNATE-STYLESHEET? Wouldn't it be better to normalize the capitalization of all case-insensitive tokens to lowercase? (because it looks nicer)
> 
> 
> This algorithm uses the http://purl.org/dc/terms/ namespace, while the mapping at <http://dev.w3.org/html5/md/#conversion-to-rdf> uses the http://purl.org/dc/elements/1.1/ namespace. http://purl.org/dc/terms/ seems to be the canonical namespace at this time, so I suggest just using that.
> 
> 
> What is the reasoning behind the steps for "If name contains no U+003A COLON character (:)"? I assume that # is added to normalize URLs that end with # where people sometimes just remove that. But what's the colon for? Some non-normative explanation of the monster URLs that these steps produce would be helpful.
> 
> 
> There's an issue with how vocabularies that use subitems are currently handled. In short, triples are only generated if the item either has a type which is an absolute URL or if the item property is an absolute URL. This prevents site-private data from being exported as RDF, which is a good thing. However, for vocabularies which have an item type for the top-level item but not for subitems (which seems quite unnecessary) this means that no triples are generated for the subitems, even though the subitem reasonably be considered to be using the same vocabulary as the typed top-level item. To illustrate the point, here's the output of the RDF extraction (as Turtle) from the Jack Bauer example if the current spec is honored:
> 
> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
> @prefix dcterms: <http://purl.org/dc/terms/> .
> @prefix hcard: <http://www.w3.org/1999/xhtml/microdata#http%3A%2F%2Fmicroformats.org%2Fprofile%2Fhcard%23%3A> .
> 
> <http://foolip.org/microdatajs/demo/turtle.html> dcterms:title "RDF/Turtle demo" ;
> 	<http://www.w3.org/1999/xhtml/microdata#item> _:n0 .
> _:n0 rdf:type <http://microformats.org/profile/hcard> ;
>     hcard:fn "Jack Bauer" ;
>     hcard:photo <http://worf.foolip.org/microdatajs/demo/jack-bauer.jpg> ;
>     hcard:org _:n1 ;
>     hcard:adr _:n2 ;
>     hcard:geo "34.052339;-118.410623" ;
>     hcard:tel _:n3 ;
>     hcard:url <http://en.wikipedia.org/wiki/Jack_Bauer> ;
>     hcard:url <http://www.jackbauerfacts.com/> ;
>     hcard:email "j.bauer@la.ctu.gov.invalid" ;
>     hcard:tel _:n4 ;
>     hcard:note "If I'm \"out in the field\", you may be better off\n contacting Chloe O'Brian if it's about\n work, or ask Tony Almeida if\n you're interested in the CTU five-a-side football team we're trying\n to get going." ;
>     hcard:agent _:n5 ;
>     hcard:agent "Tony Almeida" ;
>     hcard:rev _:n6 ;
>     hcard:tel _:n7 .
> _:n5 rdf:type <http://microformats.org/profile/hcard> ;
>     hcard:email <mailto:c.obrian@la.ctu.gov.invalid> ;
>     hcard:fn "Chloe O'Brian" .
> 
> As you see, the structured subitems org, adr, etc just point to blank nodes with no further triples for those nodes. My fix is to pass on the type of the parent item when generating triples for subitems as a default, which is overridden if the subitem defines its own type (as e.g. agent does in the above). I think this is sensible and it certainly produces a more complete RDF graph:
> 
> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
> @prefix dcterms: <http://purl.org/dc/terms/> .
> @prefix hcard: <http://www.w3.org/1999/xhtml/microdata#http%3A%2F%2Fmicroformats.org%2Fprofile%2Fhcard%23%3A> .
> 
> <http://foolip.org/microdatajs/demo/turtle.html> dcterms:title "RDF/Turtle demo" ;
> 	<http://www.w3.org/1999/xhtml/microdata#item> _:n0 .
> _:n0 rdf:type <http://microformats.org/profile/hcard> ;
>     hcard:fn "Jack Bauer" ;
>     hcard:photo <http://worf.foolip.org/microdatajs/demo/jack-bauer.jpg> ;
>     hcard:org _:n1 ;
>     hcard:adr _:n2 ;
>     hcard:geo "34.052339;-118.410623" ;
>     hcard:tel _:n3 ;
>     hcard:url <http://en.wikipedia.org/wiki/Jack_Bauer> ;
>     hcard:url <http://www.jackbauerfacts.com/> ;
>     hcard:email "j.bauer@la.ctu.gov.invalid" ;
>     hcard:tel _:n4 ;
>     hcard:note "If I'm \"out in the field\", you may be better off\n contacting Chloe O'Brian if it's about\n work, or ask Tony Almeida if\n you're interested in the CTU five-a-side football team we're trying\n to get going." ;
>     hcard:agent _:n5 ;
>     hcard:agent "Tony Almeida" ;
>     hcard:rev _:n6 ;
>     hcard:tel _:n7 .
> _:n1 hcard:organization-name "Counter-Terrorist Unit" ;
>     hcard:organization-unit "Los Angeles Division" .
> _:n2 hcard:street-address "10201 W. Pico Blvd." ;
>     hcard:locality "Los Angeles" ;
>     hcard:region "CA" ;
>     hcard:postal-code "90064" ;
>     hcard:country-name "United States" .
> _:n3 hcard:value "+1 (310)\n  597 3781" ;
>     hcard:type "work" ;
>     hcard:type "pref" .
> _:n4 hcard:value "+1 (310) 555\n  3781" ;
>     hcard:type "cell" .
> _:n5 rdf:type <http://microformats.org/profile/hcard> ;
>     hcard:email <mailto:c.obrian@la.ctu.gov.invalid> ;
>     hcard:fn "Chloe O'Brian" .
> _:n6 hcard:type "date-time" ;
>     hcard:value "2008-07-20T21:00:00+01:00" .
> _:n7 hcard:type "home" ;
>     hcard:value "01632 960 123" .
> 
> This look good to me, but I'm no RDF expert, so feedback on whether these triples are useful and can easily be mapped other vocabularies would be nice. (Note that my Turtle export pretty-prints a bit and adds some common prefixes for readability, but that's not part of the microdata spec, which has as its output abstract RDF triples with nothing to say about serialization.)
> 
> 
> Finally, some questions on how to apply the requirements of <http://dev.w3.org/html5/md/#conversion-to-rdf>. I simply filtered the triples a bit before outputting them, but is this the intended solution? The first requirement is 'For the purposes of RDF processors, blank nodes that are the subject of a triple with the predicate "http://www.w3.org/1999/xhtml/microdata#http%3A%2F%2Fn.whatwg.org%2Fwork%23%3Awork" and the object s must be treated as if the node was identified by s.' Can this be expressed using OWL? The last 3 requirements are simple predicate equivalences and can be expressed with owl:equivalentProperty, I think. If all of these requirements can in fact be expressed using OWL, adding non-normative text stating what exact triples accomplish that would be helpful.
> 
> [1] http://gitorious.org/microdatajs
> 
> -- 
> Philip Jägenstedt
> 
Received on Sunday, 17 January 2010 23:13:52 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:16:58 GMT