Re: Microdata to RDF conversion from Philip Jägenstedt on 2010-02-02 (public-html@w3.org from February 2010)

From: Philip Jägenstedt <philipj@opera.com>
Date: Tue, 02 Feb 2010 23:43:52 +0100
To: "Ian Hickson" <ian@hixie.ch>
Cc: "HTML WG" <public-html@w3.org>
Message-ID: <op.u7iuresasr6mfa@worf>
On Fri, 29 Jan 2010 10:29:54 +0100, Ian Hickson <ian@hixie.ch> wrote:

> On Wed, 20 Jan 2010, Philip Jägenstedt wrote:
>> On Tue, 19 Jan 2010 09:22:25 +0100, Ian Hickson <ian@hixie.ch> wrote:
>> > On Sun, 17 Jan 2010, Philip Jägenstedt wrote:
>> > > There's an issue with how vocabularies that use subitems are
>> > > currently handled. In short, triples are only generated if the item
>> > > either has a type which is an absolute URL or if the item property
>> > > is an absolute URL. This prevents site-private data from being
>> > > exported as RDF, which is a good thing. However, for vocabularies
>> > > which have an item type for the top-level item but not for subitems
>> > > (which seems quite unnecessary) this means that no triples are
>> > > generated for the subitems, even though the subitem reasonably be
>> > > considered to be using the same vocabulary as the typed top-level
>> > > item. To illustrate the point, here's the output of the RDF
>> > > extraction (as Turtle) from the Jack Bauer example if the current
>> > > spec is honored: [...]
>> > >
>> > > As you see, the structured subitems org, adr, etc just point to
>> > > blank nodes with no further triples for those nodes. My fix is to
>> > > pass on the type of the parent item when generating triples for
>> > > subitems as a default, which is overridden if the subitem defines
>> > > its own type (as e.g. agent does in the above). I think this is
>> > > sensible and it certainly produces a more complete RDF graph: [...]
>> >
>> > That works if you know the vocabulary and thus know that the nested
>> > subitem is from that vocabulary, but it seems highly suspect in the
>> > case where you don't know that. Also, consider:
>> >
>> >   <div itemscope itemtype="http://example.com/person">
>> >    <p itemprop="school" itemscope>
>> >     I go to school in the <span itemprop="class">middle</span>  
>> classroom.
>> >    </p>
>> >    <p itemprop="demographics" itemscope>
>> >     I am <span itemprop="class">middle</span>-classed.
>> >    </p>
>> >   </div>
>> >
>> > (A bit contrived, but you get the idea.) It would be wrong to use the
>> > same predicate for both itemprep="class" cases.
>>
>> Since no itemtype is used for itemprop="school",
>> http://example.com/person must define this as part of its vocabulary,
>> unless the above is an example of invalid markup. Since it's all one
>> vocabulary, using the same prefix for the RDF predicates seems quite
>> logical.
>
> I really don't think they're the same predicate, but I agree that we need
> to expose these triples somehow. Consider:
>
>    <div itemscope itemtype="http://example.com/a" itemref="x"></div>
>    <div itemscope itemtype="http://example.com/b" itemref="x"></div>
>    <div id="x"> <p itemprop="q" itemscope> <span itemprop="r">s</span>  
> </p> </div>
>
> Right now this generates four blank nodes, which is a bug, it should
> generate three. But if we generate three, then what predicate do we use
> for the itemprop?
>
> I ended up going with kind of a compromise solution. The above generates
> _four_ triples, but _three_ nodes:
>
>    @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>    @prefix eg:  
> <http://www.w3.org/1999/xhtml/microdata#http%3A%2F%2Fexample.com%2F> .
>
>    _:n0 rdf:type <http://example.com/a> ;
>         <eg:a%23%3Aq> _:n2 .
>
>    _:n1 rdf:type <http://example.com/b> ;
>         <eg:b%23%3Aq> _:n2 .
>
>    _:n2 <eg:a%23%3Aq%20r> "s" ;
>         <eg:b%23%3Aq%20r> "s" .
>
> Basically, instead of using "type:name", I used "type:parent-name name",
> where the space character is another character that, like ":", cannot
> appear in "name" and thus is usable here without making anything
> ambiguous. Hopefully this solves the problem relatively neatly, if not in
> the most performance-optimal way.

The item-subject cache seems reasonable. Handling of subitems, however, is  
really messy. I can't tell if it's a bug or not but there's  
double-encoding in step 5 and a colon too many appended somewhere, so the  
actual result is: [1]

@prefix ega:  
<http://www.w3.org/1999/xhtml/microdata#http%3A%2F%2Fexample.com%2Fa%23%3A>  
.
@prefix egb:  
<http://www.w3.org/1999/xhtml/microdata#http%3A%2F%2Fexample.com%2Fb%23%3A>  
.

_:n0
   rdf:type <http://example.com/a> ;
   ega:q _:n1 .
_:n1
   ega:%3Aq%2520r "s" ;
   egb:%3Aq%2520r "s" .
_:n2
   rdf:type <http://example.com/b> ;
   egb:q _:n1 .

Notice the extra %3A (: %-escaped) and %2520 (%20 %-escaped). I haven't  
tried pinpointing the exact problem because I hope this mail will make it  
unnecessary. In my opinion strange %-escaping of the URLs is acceptable as  
long as it's hidden in the prefix, but the above (even if bugfixed) is too  
messy -- we might as well not bother and let everyone roll their own  
vocabulary-specific extraction.

The intention was to address the case where two differently typed items  
share a single typeless subitem. Is this a reasonable case to begin with  
and should it even be valid? It would only make sense if two vocabularies  
happened to share the same names and structure for something, like say adr  
in vcard. Whenever this is intentional I would argue that the subitem  
should have a type that is shared between the two vocabularies. In all  
other cases this is more an attempt to optimize for accidental overlap. I  
don't think there's any reason to worry about it, but it could just as  
well be done like this:

<div itemscope itemtype="http://example.com/a">
   <div itemprop="q" itemscope itemref="x"></div>
</div>
<div itemscope itemtype="http://example.com/b">
   <div itemprop="q" itemscope itemref="x"></div>
</div>
<div id="x"> <span itemprop="r">s</span> </div>

Without actual examples it's difficult what to make of this, but it  
doesn't seem obviously better or worse.

Putting shared untyped subitems aside for a moment, what should be done  
with the rest?

I'll simply reiterate my original suggestion to use the parent type as the  
fallback type. The structure of the item is already reflected in the RDF  
graph, there's really no need to also reflect it in the properties. In the  
odd case that two typed items shared a single untyped item the result  
won't be *that* strange anyway:

_:n0
   rdf:type <http://example.com/a> ;
   ega:q _:n1 .
_:n1
   ega:r "s" ;
   egb:r "s" .
_:n2
   rdf:type <http://example.com/b> ;
   egb:q _:n1 .

(using the markup from [1])

Any strangeness is because the original markup was strange. It's still  
possible to see which properties originated where simply from the URL  
prefixes, if that is somehow important. However, I'm not sure that caching  
untyped blank nodes is really a great idea. We could limit caching to  
items with itemtype to allow sensible subitem sharing but not output  
anything dodgy in cases like the above.

[1]  
<http://foolip.org/microdatajs/live/?html=%3Cdiv%20itemscope%20itemtype%3D%22http%3A%2F%2Fexample.com%2Fa%22%20itemref%3D%22x%22%3E%3C%2Fdiv%3E%0A%3Cdiv%20itemscope%20itemtype%3D%22http%3A%2F%2Fexample.com%2Fb%22%20itemref%3D%22x%22%3E%3C%2Fdiv%3E%0A%3Cdiv%20id%3D%22x%22%3E%20%3Cp%20itemprop%3D%22q%22%20itemscope%3E%20%3Cspan%20itemprop%3D%22r%22%3Es%3C%2Fspan%3E%20%3C%2Fp%3E%20%3C%2Fdiv%3E#turtle>

-- 
Philip Jägenstedt
Received on Tuesday, 2 February 2010 22:45:08 UTC