Re: Suggestion for Microdata to RDF conversion from Benjamin Nowack on 2010-01-22 (public-html@w3.org from January 2010)

From: Benjamin Nowack <bnowack@semsol.com>
Date: Fri, 22 Jan 2010 11:22:50 +0100
To: Ian Hickson <ian@hixie.ch>
Cc: "Tab Atkins Jr." <jackalmage@gmail.com>, public-html@w3.org, "Philip Jägenstedt" <philipj@opera.com>
Message-ID: <PM-GA.20100122112250.978C2.3.1D@semsol.com>
On 22.01.2010 09:30:58, Ian Hickson wrote:
>On Fri, 22 Jan 2010, Benjamin Nowack wrote:
>> 
>> P.S. as I just saw Ian's comment on IRC[1]:
>> 
>> This algorithm ignores non-RDF structures such as
>> 
>>    <div itemscope itemtype="http://example.com/">
>>       <span itemprop="a/b"/>
>>    </div>
>> or
>>    <div itemscope itemtype="http://example.com/a/">
>>       <span itemprop="b"/>
>>    </div>
>>
>> because common RDF vocabularies simply don't use URI patterns like 
>> "http://example.com/" or "http://example.com/a/" to declare resource 
>> types.
>
>Given this regexp (from your earlier e-mail):
>
>   /^(.*[\/\#])([^\/\#]+)$/
>
>...I understand that "a/b" wouldn't be usable as a keyword, because the 
>regexp's second pattern doesn't match strings with / and # characters. But 
>why would the other three types not work?
The regexp only applies to itemtype. itemprops wouldn't have any # or
/ at all. "http://example.com/" and "http://example.com/a/" are not
accepted by the regex because there are no (RDF) vocabularies where
these URLs (trailing slash or hash) are used to specify a type.

>Into what RDF statements would your proposal turn the above two examples?
My algorith doesn't fire on those. The itemtype is not an RDF class. No 
typical RDF is generated. It may be converted to the prefixed/escaped
triples, but I don't think RDFers would really define OWL axioms for
each and every type and prop to extract sane RDF from those triples. 
I'm not aware of many OWL apps that apply ontological operations to RDF 
from HTML in the wild.

>Another example would be:
>
>    <div itemscope itemtype="http://example.com/vocab#">
>       <span itemprop="x"/>
>       <span itemprop="http://example.com/vocab#x"/>
>    </div>
>
>For sanity, in the microdata model, this has to be two distinct 
>properties. What RDF would your proposal convert the above into?
Same as above, the itemtype is not an RDF class.

Here is one that'd be RDF:

   <div itemscope itemtype="http://example.com/vocab#Example">
      <span itemprop="x"/>
      <span itemprop="http://example.com/vocab#x"/>
   </div>

Assuming that empty values make sense, the two properties would 
result in the same predicate URI: http://example.com/vocab#x
because "x" (per spec wording) is from the same vocabulary as 
http://example.com/vocab#Example, and "http://example.com/vocab#x"
is a full URI, which happens to be from the same vocab, too. 
It's fine to have 2 distinct properties in the Microdata model
including the DOM API, but effectively just one in RDF. The RDF 
model differs in other situations, too (graph vs. tree etc). If
the 2 models were identical, there wouldn't have been a need for 
Microdata in the first place. It would of course be possible to
mandate that URI-based itemprops MUST NOT be from the same 
vocabulary specified by the itemtype. This would be intuitive
as URI-based itemprops are meant to enable vocab mixing. It doesn't
make too much sense to specify a context vocabulary and still use 
fully qualified itemprop URLs.


>> Requiring OWL magic to convert Microdata to its target RDF vocabulary 
>> makes Microdata even more complex to understand than RDFa. OWL is well 
>> beyond what a Microdata-to-RDF parser writer should need to know.
>
>The parser wouldn't need to know it at all, that's the point. The parser 
>can just convert it all into RDF, and then a simple blob of OWL can be 
>added to the triple store so that any RDF use of the data will treat the 
>microdata-originating properties as equivalent to the more commonly used 
>RDF vocabularies'.(After all, if the user didn't intend to use tools that 
>leverage the power of RDF, there's really not much point going to the 
>trouble to convert everything into RDF in the first place. The user could 
>just as easily simply use a JSON-like data structure, which is easier to 
>understand and query for most purposes.)
Well, ... ;) RDF is RDF, and OWL is OWL. Even if certain OWL axioms can be
written in simple RDF blobs, this doesn't mean that evaluating these
definitions is equally simple. You need an inference engine or at least
a SPARQL processor with UPDATE functionality. The overlap between people
who use RDF as a data integration mechanism and those who run OWL engines
is pretty small. Have a look at [1], you can find communities for each
colour, some overlap, some don't. I've created dozens of RDF apps, I can't
remember when I last required OWL. 

The problem with OWL-based Microdata to RDF mappings is that someone would
have to define a mapping for each term. Unfortunately, there is no RDF 
mechanism where you could auto-convert all terms prefixed with 
"http://www.w3.org/1999/xhtml/microdata#" to something else. And even
if you had the OWL axioms and an OWL processor, you'd end up with twice
the triples than those generated by the parser. And inferred triples
are not necessarily associated with the same originating graph, i.e.
you'd lose provenance information, unless you build the OWL processing
into the parser.

Having said that, there is still an easy way to end up with proper RDF
triples even if the conversion algorithm is kept as is. The parser just
reverts the prefixing/escaping in case of RDF itemtypes. But it would
be nice if RDF converters wouldn't need that extra step.

Cheers,
Benji

[1] http://bnode.org/blog/2009/07/08/the-semantic-web-not-a-piece-of-cake


>
>-- 
>Ian Hickson               U+1047E                )\._.,--....,'``.    fL
>http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
>Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
>
Received on Friday, 22 January 2010 10:23:19 UTC