Re: Microdata to RDF conversion

On Tue, 2 Feb 2010, Philip Jägenstedt wrote:
> On Fri, 29 Jan 2010 10:29:54 +0100, Ian Hickson <ian@hixie.ch> wrote:
> > On Wed, 20 Jan 2010, Philip Jägenstedt wrote:
> > > On Tue, 19 Jan 2010 09:22:25 +0100, Ian Hickson <ian@hixie.ch> wrote:
> > > > On Sun, 17 Jan 2010, Philip Jägenstedt wrote:
> > > > > There's an issue with how vocabularies that use subitems are
> > > > > currently handled. In short, triples are only generated if the item
> > > > > either has a type which is an absolute URL or if the item property
> > > > > is an absolute URL. This prevents site-private data from being
> > > > > exported as RDF, which is a good thing. However, for vocabularies
> > > > > which have an item type for the top-level item but not for subitems
> > > > > (which seems quite unnecessary) this means that no triples are
> > > > > generated for the subitems, even though the subitem reasonably be
> > > > > considered to be using the same vocabulary as the typed top-level
> > > > > item. To illustrate the point, here's the output of the RDF
> > > > > extraction (as Turtle) from the Jack Bauer example if the current
> > > > > spec is honored: [...]
> > > > >
> > > > > As you see, the structured subitems org, adr, etc just point to
> > > > > blank nodes with no further triples for those nodes. My fix is to
> > > > > pass on the type of the parent item when generating triples for
> > > > > subitems as a default, which is overridden if the subitem defines
> > > > > its own type (as e.g. agent does in the above). I think this is
> > > > > sensible and it certainly produces a more complete RDF graph: [...]
> > > > 
> > > > That works if you know the vocabulary and thus know that the nested
> > > > subitem is from that vocabulary, but it seems highly suspect in the
> > > > case where you don't know that. Also, consider:
> > > > 
> > > >   <div itemscope itemtype="http://example.com/person">
> > > >    <p itemprop="school" itemscope>
> > > >     I go to school in the <span itemprop="class">middle</span>
> > > > classroom.
> > > >    </p>
> > > >    <p itemprop="demographics" itemscope>
> > > >     I am <span itemprop="class">middle</span>-classed.
> > > >    </p>
> > > >   </div>
> > > > 
> > > > (A bit contrived, but you get the idea.) It would be wrong to use the
> > > > same predicate for both itemprep="class" cases.
> > > 
> > > Since no itemtype is used for itemprop="school",
> > > http://example.com/person must define this as part of its vocabulary,
> > > unless the above is an example of invalid markup. Since it's all one
> > > vocabulary, using the same prefix for the RDF predicates seems quite
> > > logical.
> > 
> > I really don't think they're the same predicate, but I agree that we need
> > to expose these triples somehow. Consider:
> > 
> >   <div itemscope itemtype="http://example.com/a" itemref="x"></div>
> >   <div itemscope itemtype="http://example.com/b" itemref="x"></div>
> >   <div id="x"> <p itemprop="q" itemscope> <span itemprop="r">s</span> </p>
> > </div>
> > 
> > Right now this generates four blank nodes, which is a bug, it should
> > generate three. But if we generate three, then what predicate do we use
> > for the itemprop?
> > 
> > I ended up going with kind of a compromise solution. The above generates
> > _four_ triples, but _three_ nodes:
> > 
> >   @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
> >   @prefix eg:
> > <http://www.w3.org/1999/xhtml/microdata#http%3A%2F%2Fexample.com%2F> .
> > 
> >   _:n0 rdf:type <http://example.com/a> ;
> >        <eg:a%23%3Aq> _:n2 .
> > 
> >   _:n1 rdf:type <http://example.com/b> ;
> >        <eg:b%23%3Aq> _:n2 .
> > 
> >   _:n2 <eg:a%23%3Aq%20r> "s" ;
> >        <eg:b%23%3Aq%20r> "s" .
> > 
> > Basically, instead of using "type:name", I used "type:parent-name name",
> > where the space character is another character that, like ":", cannot
> > appear in "name" and thus is usable here without making anything
> > ambiguous. Hopefully this solves the problem relatively neatly, if not in
> > the most performance-optimal way.
> 
> The item-subject cache seems reasonable. Handling of subitems, however, is
> really messy. I can't tell if it's a bug or not but there's double-encoding in
> step 5 and a colon too many appended somewhere, so the actual result is: [1]
> 
> @prefix ega:
> <http://www.w3.org/1999/xhtml/microdata#http%3A%2F%2Fexample.com%2Fa%23%3A> .
> @prefix egb:
> <http://www.w3.org/1999/xhtml/microdata#http%3A%2F%2Fexample.com%2Fb%23%3A> .
> 
> _:n0
>  rdf:type <http://example.com/a> ;
>  ega:q _:n1 .
> _:n1
>  ega:%3Aq%2520r "s" ;
>  egb:%3Aq%2520r "s" .
> _:n2
>  rdf:type <http://example.com/b> ;
>  egb:q _:n1 .
> 
> Notice the extra %3A (: %-escaped) and %2520 (%20 %-escaped). 

That's a bug. The %s are probably being escaped again in the step that 
adds the "http://www.w3.org/1999/xhtml/microdata#", but shouldn't be, 
because they are valid in the <ifragment> production. That wasn't at all 
clear in the spec, so I've made it clearer.

There's also no need to escape the "/"s. They are allowed in <ifragment> 
components.

To elaborate:

> > <div id=a itemscope itemtype="http://example.com/a" itemref="x"></div>
> > <div id=b itemscope itemtype="http://example.com/b" itemref="x"></div>
> > <div id=x>
> >   <p id=q itemprop="q" itemscope>
> >     <span id=r itemprop="r">s</span>
> >   </p>
> > </div>

The item with id=a has type "http://example.com/a".
  
It has a property "q"; in RDF terms, its URL is constructed from
"http://example.com/a" and "q", as follows:

   "http://www.w3.org/1999/xhtml/microdata#" + escape("http://example.com/a#:" + escape("q"))
   => "http://www.w3.org/1999/xhtml/microdata#http://example.com/a%23:q"

"q" has a property "r"; in RDF terms, its URL is constructed from the type
of "q" plus the name of the property "r". The type of "q" is blank, so we 
pretend it had a type constructed from the same bits as the earlier URL:  
  
   http://example.com/a#:" + escape("q")
   => "http://example.com/a#:q"

That makes the URL for "r":

   "http://www.w3.org/1999/xhtml/microdata#" + escape("http://example.com/a#:" + escape("q") + "%20" + escape("r"))
   => "http://www.w3.org/1999/xhtml/microdata#http://example.com/a%23:q%20r"

...but there's no need to escape %-sequences again on the outer scope.


> In my opinion strange %-escaping of the URLs is acceptable as long as 
> it's hidden in the prefix, but the above (even if bugfixed) is too messy 
> -- we might as well not bother and let everyone roll their own 
> vocabulary-specific extraction.

There's no prefixing in RDF, only in (some) RDF serializations. I don't 
see why anyone would want to look at the output of the microdata-to-RDF 
conversion; the whole point would be to just put it in a triple store and 
compute on it, not play with it by hand. If you want to play with it by 
hand, just use microdata.


> The intention was to address the case where two differently typed items 
> share a single typeless subitem. Is this a reasonable case to begin with 
> and should it even be valid?

I don't see why it wouldn't be allowed to be valid. That would be quite an 
odd constraint.


> It would only make sense if two vocabularies happened to share the same 
> names and structure for something, like say adr in vcard. Whenever this 
> is intentional I would argue that the subitem should have a type that is 
> shared between the two vocabularies.

I don't disagree, but we're not going to be able to convince people to 
stick itemtype=""s on their subtypes, that's far too verbose. It can be 
inferred from context, so the computer should do it.


> In all other cases this is more an attempt to optimize for accidental 
> overlap. I don't think there's any reason to worry about it, but it 
> could just as well be done like this:
> 
> <div itemscope itemtype="http://example.com/a">
>  <div itemprop="q" itemscope itemref="x"></div>
> </div>
> <div itemscope itemtype="http://example.com/b">
>  <div itemprop="q" itemscope itemref="x"></div>
> </div>
> <div id="x"> <span itemprop="r">s</span> </div>
> 
> Without actual examples it's difficult what to make of this, but it 
> doesn't seem obviously better or worse.

I don't see how the above changes matters. We'd still need to come up with 
unique URLs for the two uses of property "r". In fact the above should 
have identical output to the earlier example, no?


> I'll simply reiterate my original suggestion to use the parent type as 
> the fallback type. The structure of the item is already reflected in the 
> RDF graph, there's really no need to also reflect it in the properties.

This can trivially lead to ambiguous cases. I don't think that's wise.


> In the odd case that two typed items shared a single untyped item the 
> result won't be *that* strange anyway:
> 
> _:n0
>  rdf:type <http://example.com/a> ;
>  ega:q _:n1 .
> _:n1
>  ega:r "s" ;
>  egb:r "s" .
> _:n2
>  rdf:type <http://example.com/b> ;
>  egb:q _:n1 .
> 
> (using the markup from [1])

   <div itemscope itemtype="http://example.com/student">
    <p itemprop="school" itemscope>
     <span itemprop="class">Mr Fitz</span>
    </p>
    <p itemprop="demographics" itemscope>
     <span itemprop="class">Poor</span>
    </p>
   </div>

What is the meaning of the "http://example.com/student#:class" predicate?


> Any strangeness is because the original markup was strange. It's still 
> possible to see which properties originated where simply from the URL 
> prefixes, if that is somehow important. However, I'm not sure that 
> caching untyped blank nodes is really a great idea. We could limit 
> caching to items with itemtype to allow sensible subitem sharing but not 
> output anything dodgy in cases like the above.

That's what we had originally -- items without a type were dropped. But 
people are going to use items like that. I don't think we should make the 
RDF conversion so brittle that apparently equivalent expressions fail or 
act differently.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 6 April 2010 21:59:49 UTC