Re: Microdata to RDF conversion

On Sun, 17 Jan 2010, Philip Jägenstedt wrote:
>
> http://dev.w3.org/html5/md/#rdf
> 
> I've reviewed and implemented this as part of microdatajs [1] and came 
> across a few issues.
> 
> Several steps talk about "the language of the element", but it isn't 
> entirely clear what this is. Should the "to determine the language of a 
> node" algorithm be used, which finds the nearest ancestor with a lang 
> attribute?

Yes. (This is clearer in the WHATWG version where the cross-references 
work -- once gsnedders' new tool is done we'll be able to have that here 
too.)


> Is there any particular reason for the uppercase token 
> ALTERNATE-STYLESHEET? Wouldn't it be better to normalize the 
> capitalization of all case-insensitive tokens to lowercase? (because it 
> looks nicer)

It has to be uppercase to not clash with rel="alternate-stylesheet".


> This algorithm uses the http://purl.org/dc/terms/ namespace, while the 
> mapping at <http://dev.w3.org/html5/md/#conversion-to-rdf> uses the 
> http://purl.org/dc/elements/1.1/ namespace. http://purl.org/dc/terms/ 
> seems to be the canonical namespace at this time, so I suggest just 
> using that.

Wait, what? I'm confused. What exactly are you saying should change?


> What is the reasoning behind the steps for "If name contains no U+003A 
> COLON character (:)"? I assume that # is added to normalize URLs that 
> end with # where people sometimes just remove that. But what's the colon 
> for? Some non-normative explanation of the monster URLs that these steps 
> produce would be helpful.

The # is intended to ensure that we don't make up new URLs, which would be 
poor form, as far as I can tell, since they might not resolve.

The : is intended to separate the type URL from the name, using a 
character that cannot appear in the name (since that could lead to 
ambiguities).


> There's an issue with how vocabularies that use subitems are currently 
> handled. In short, triples are only generated if the item either has a 
> type which is an absolute URL or if the item property is an absolute 
> URL. This prevents site-private data from being exported as RDF, which 
> is a good thing. However, for vocabularies which have an item type for 
> the top-level item but not for subitems (which seems quite unnecessary) 
> this means that no triples are generated for the subitems, even though 
> the subitem reasonably be considered to be using the same vocabulary as 
> the typed top-level item. To illustrate the point, here's the output of 
> the RDF extraction (as Turtle) from the Jack Bauer example if the 
> current spec is honored: [...]
> 
> As you see, the structured subitems org, adr, etc just point to blank nodes
> with no further triples for those nodes. My fix is to pass on the type of the
> parent item when generating triples for subitems as a default, which is
> overridden if the subitem defines its own type (as e.g. agent does in the
> above). I think this is sensible and it certainly produces a more complete RDF
> graph: [...]

That works if you know the vocabulary and thus know that the nested 
subitem is from that vocabulary, but it seems highly suspect in the case 
where you don't know that. Also, consider:

   <div itemscope itemtype="http://example.com/person">
    <p itemprop="school" itemscope>
     I go to school in the <span itemprop="class">middle</span> classroom.
    </p>
    <p itemprop="demographics" itemscope>
     I am <span itemprop="class">middle</span>-classed.
    </p>
   </div>

(A bit contrived, but you get the idea.) It would be wrong to use the same 
predicate for both itemprep="class" cases.

Long story short, I think it's better to just use itemtype="" everywhere 
you want to start a new vocabulary. This does mean the vCard vocaburary 
doesn't really convert to RDF well, but what's the use case for that? I 
would have thought most people would just use vCard, if they wanted to 
convert this to another format.


> Finally, some questions on how to apply the requirements of 
> <http://dev.w3.org/html5/md/#conversion-to-rdf>. I simply filtered the 
> triples a bit before outputting them, but is this the intended solution? 
> The first requirement is 'For the purposes of RDF processors, blank 
> nodes that are the subject of a triple with the predicate 
> "http://www.w3.org/1999/xhtml/microdata#http%3A%2F%2Fn.whatwg.org%2Fwork%23%3Awork" 
> and the object s must be treated as if the node was identified by s.' 
> Can this be expressed using OWL? The last 3 requirements are simple 
> predicate equivalences and can be expressed with owl:equivalentProperty, 
> I think. If all of these requirements can in fact be expressed using 
> OWL, adding non-normative text stating what exact triples accomplish 
> that would be helpful.

Done.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 19 January 2010 08:22:56 UTC