Re: Updated Microdata to RDF spec from Gregg Kellogg on 2011-11-22 (public-html-data-tf@w3.org from November 2011)

From: Gregg Kellogg <gregg@kellogg-assoc.com>
Date: Mon, 21 Nov 2011 20:15:24 -0500
To: Ivan Herman <ivan@w3.org>
CC: HTML Data Task Force WG <public-html-data-tf@w3.org>
Message-ID: <75C12D80-F33A-438F-8A53-5A88C69AA01B@greggkellogg.net>
Hi Ivan,

On Nov 21, 2011, at 1:58 AM, Ivan Herman wrote:

> Gregg,
> 
> thanks for folding in all the issues!
> 
> However... I must admit I am quite unhappy with the current design due to the necessity to use a registry. Introduction of such a registry would, I believe, make the md->RDF conversion process way too complicated. We'd also get to the problem of a dependency on network for each and every conversion process, with all the consequence of efficiency and the necessity to design around network failure. These are exactly the reasons why the RDFa WG, for example, dropped the @profile idea a while ago, in spite of the elegance of that approach.

I understand, and I don't think the registry will survive the process either, but at this point in the process I wanted to preserve all the options and the registry is a useful rhetorical vehicle for describing different behavior. My marching orders were to include many of the different options so that a future Working Group could have a menu of behaviors to decide from.

If there were a registry, my assumption would be that it was not dynamic, but would be set up with just a few vocabularies, with the default behavior useful the the majority other other vocabularies. In fact, leaving out Microformat vocabularies, the only one that may need special treatment is schema.org, due to it's extension mechanism.

I'd be happy to remove an explicit registry, and use an equivalent internal mechanism for vocabulary detection when parsing, if that's the decision of the group.

> Also: from an RDF usage point of view (and this is clearly my concern), what you call the 'contextual' approach will rarely be used by the RDF community in my view. If I take the schema.org example, a URI of the form
> 
> http://www.w3.org/ns/md?type=http://schema.org/Person&prop=name.
> 
> would be unnecessarily complex. Even if URI-s are opaque in RDF, practical usage can be hindered by such URI-s.
> 
> So... we may have to accept that, in some cases, the md->RDF conversion is lossy. Lossy in the sense that it may not reflect, by default, all the intentions of the microdata design (you yourself make this note in the document on the section of "vocabulary"). 

My view is that the "contextual" scheme will end up not being necessary. It was there principally to address Hixie's assertion that property tokens take on different semantics depending on the context they're used in, and that is the reason for constructing different URIs based on the context. To the degree that this thinking was influenced by Microformats, I think the view may not be correct. Tantek asserted at the TPAC breakout that properties should be in a flat namespace, which is consistent with our "vocabulary" scheme.

That said, reducing it down to a single scheme still isn't enough to get rid of the need for a registry, because we can't reliably intuit the base URI of the vocabulary from an @itemtype (given schema.org's extension model, and, for example, the hcard vocabulary URI).

One possible (though not perfect) way to deal with this would be to detect the vocabulary URI from the first use of an @itemtype, not with each definition. There is still the potential problem that the _only_ use of @itemtypes is for extensions (e.g., http://schema.org/Person/Teacher), and we would guess correctly. This could possibly be addressed through the use of best practices, where the extension URI would always be used along with the prime URI (http://schema.org/Person, in this case). For example:

<div itemscope itemtype="http://schema.org/Person http://schema.org/Person/Teacher">
  <p itemprop="name">Ivan Herman</p>
</div>

would allow us to determine that the base vocabulary URI is http:/schema.org/, and correctly construct http://schema.org/name as the predicate URI.

> To make things more specific, here are some thoughts that are, in my view, worth discussing.
> 
> 1. Think about a "Conversion Lite" and "Conversion Full". "Conversion Lite" should be usable without any registry whatsover. We _may_ think about a "Conversion Full" for the few cases that are not working with Lite and we must reflect the original design, e.g., the "contextual" option. My personal expectation, which is of course  not proven at this point, that the need for 'Full' may be minor. (See the items below for the various defaults.)
> 
> 2. The "Lite" version for the property URI generation should be "vocabulary". At this moment I do not see any real use case for "contextual" used out there on the Web, whereas "vocabulary" should work with, say, schema.org, which is, clearly, _the_ major use case for microdata or with the hcard example which would make it compatible to the current RDF mappings of cards.

It works with hCalendar, but not hCard, sadly. And, the schema.org extension mechanism is a potential issue. We could consider baking baking in the Microformat URIs, as they are pretty stable.

> 3. As far as datatype generation is concerned, we should, mostly, keep away from that (certainly for "Lite"). Microdata does not care about datatypes, we should simply run with that and not try to outsmart microdata. The exception may be when a specific HTML element does define datatypes, like the <time> element. If the author cares about RDF Datatypes, he/she should use RDFa 1.1 Lite, whose complexity is comparable to that of microdata, ie, it would not make it more difficult to use. (Note that this is the same as the 'lang' issue. At the moment the md->RDF conversion does not care about language setting either.)

Agreed.

> 4. For the value ordering issue: what don't we do both? What I mean is: we can simply generate both the unordered
> 
> <> property "a", "b" .
> 
> _as well as_
> 
> <> property ("a" "b") .
> 
> triples. Yes, this would add some more triples, but so what? We are not talking about thousands of triples in the case microdata, ie, I do not think this is really huge practical issue. The user of the genearted RDF can safely ignore the triples that are unwanted by the application.

Yes this is a workable idea. It does make round-tripping a bit more difficult, but that's not an insurmountable problem.

> Note that item #4 is where the central registry would fail the most clearly. To take the schema.org example, they have a vocabulary set once and for all (http://schema.org/) but they will add new properties continuously. How would anyone make it sure that those property descriptions would end up in a central registry in time?

Note that my sample repository was an example only. It would be up to schema.org to specify which properties need to use an rdf:List, but your solution solves that.

> Another thought: we may think about folding into the md->RDF conversion the @vocab expansion mechanism of RDFa (maybe needless to say, but as an optional mechanism!). Some vocabularies, eg, schema.org, may set up such @vocab files anyway (we are already in discussion with DanBri on that), why not make use of those for this conversion, too?

If we can "extend" Microdata in this way, I think @vocab would be a fine solution. Of course at that point, the only real difference between Microformats and RDFa 1.1 Lite is the name of the attributes!

Thans for the feedback!

Gregg

> Cheers
> 
> Ivan
> 
> 
> On Nov 19, 2011, at 03:16 , Gregg Kellogg wrote:
> 
>> I've completed a number of updates to the Microdata to RDF spec [1]. The live editor's draft is at [2]. I believe this addresses all of the issues that we've discussed. It's a pretty substantial update.
>> 
>> This version introduces the registry, in an ad-hoc JSON form, which allows vocabularies and particular properties to take on special processing attributes. This includes property URI generation and if values are placed in an rdf:List or not.
>> 
>> Note that the registry is defined to live at http://www.w3.org/ns/md, and uses http://www.w3.org/ns/md# as a prefix. The document is not actually loaded here at this point. I'm also exploring an RDF representation of the registry, which you can see here [3][4]. Note that in this case I'm using rdfs:range semantics to determine serialization, and I've suggested some schema.org properties that may want to use an rdf:List range.
>> 
>> This version retains the <time> element, although the content model has not been updated to include the latest WHATWG version (duration, gYear, etc. equivalents).
>> 
>> My Ruby (public domain) implementation is updated, and uses an internal version of the registry. It's available for download on GitHub [5] and a live running version is on my distiller [6].
>> 
>> Comments appreciated.
>> 
>> Gregg Kellogg
>> 
>> [1] https://dvcs.w3.org/hg/htmldata/raw-file/default/ED/microdata-rdf/20111118/index.html
>> [2] https://dvcs.w3.org/hg/htmldata/raw-file/default/microdata-rdf/index.html
>> [3] https://dvcs.w3.org/hg/htmldata/raw-file/default/microdata-namespace/ns.ttl
>> [4] https://dvcs.w3.org/hg/htmldata/raw-file/default/microdata-namespace/ns.jsonld
>> [5] http://github.com/gkellogg/rdf-microdata
>> [6] http://rdf.greggkellogg.net/distiller?in_fmt=microdata
> 
> 
> ----
> Ivan Herman, W3C Semantic Web Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> FOAF: http://www.ivan-herman.net/foaf.rdf
> 
> 
> 
> 
>
Received on Tuesday, 22 November 2011 01:16:24 UTC