Re: htmldata-ISSUE-1 (Microdata Vocabulary): Vocabulary specific parsing for Microdata from Ivan Herman on 2011-10-20 (public-html-data-tf@w3.org from October 2011)

From: Ivan Herman <ivan@w3.org>
Date: Thu, 20 Oct 2011 09:23:50 +0200
To: Gregg Kellogg <gregg@kellogg-assoc.com>, Martin Hepp <martin.hepp@ebusiness-unibw.org>
Cc: HTML Data Task Force WG <public-html-data-tf@w3.org>
Message-Id: <BE7145E2-DBF2-4BA3-AC69-FCC82038F180@w3.org>
(Martin, an explicit question to you below)

On Oct 19, 2011, at 21:31 , Gregg Kellogg wrote:

> On Oct 19, 2011, at 3:24 AM, "Ivan Herman" <ivan@w3.org> wrote:
> 
>> Greg,
>> 
>> before reflecting on the issues, can somebody tell me where those mystical processing rules are defined? I looked at the microdata spec, eg, 
>> 
>> http://dev.w3.org/html5/md/#selecting-names-when-defining-vocabularies
>> 
>> which does not tell me what these are. Although I give my comments below, I am a little bit bothered by the fact that we are talking about something that is a bit unspecified.
> 
> The current WD of HTML Microdata describes a means of constructing URIs for non-URI property names [3]. It's rather complex, and I think that the generated URIs might not even be legal. The editor's draft has since withdrawn this definition, so unless we resurrect or revise it, it will eventually just go away.

Well, another way of putting it is that the specification of [3] is moot, ie, non-binding...

[snip]

>> 
>>> 
>>> There are different directions the specification could take:
>>> 
>>> * The specification requires that each vocabulary used within a document have a documented
>>> vocabulary, and the processor has vocabulary-specific rules built into the processor. Documents
>>> using unrecognized vocabularies fall back to a base-level processing, similar to that currently defined.
>>> This includes vocabulary specific rules for multi-valued properties, property URI generation and
>>> property datatype coercion.
>>> 
>>> * Processors extract the vocabulary from @itemtype and attempt to load a RDFS/OWL definition.
>>> Property URIs are created by looking for appropriate predicates defined (or referenced) from within
>>> this document. Values are coerced to rdf:List only if the predicate has a range of rdf:List. Value
>>> datatypes are coerced to the appropriate datatype based on lexical value matching if there is more
>>> than one, or by using the specific datatype if only one is listed.
>>> 
>>> * We use a generic mapping of Microdata to RDF that does not depend on vocabulary-specific rules.
>>> This may include using RDF Collections for multiple values and vocabulary-relative property URI naming,
>>> or not as we decide. The processor then may be at odds with defined property generation and
>>> value ordering rules from HTML5 Microdata.
>>> 
>> 
>> Let me concentrate first on probably the most important issue, namely the choice of the URI for the predicate terms.
>> 
>> I believe that, for microdata, we have actually three major sources of data out there.
>> 
>> 1. schema.org
>> 2. existing RDF vocabularies that use the microdata syntax to encode data
>> 3. microformats vocabularies that use the microdata syntax to encode data
>> 
>> (Obviously, #1-#3 can be mixed within the same file.)
>> 
>> The RDF/OWL mapping of schema.org does exist, though I am not sure it is considered as final. But even if it is, it is one set or rules (I presume) for the full schema.org hierarchy. It affects that base URI for the predicate URI generation, and a processor may just know that. I.e., this falls under your first option, unless the OWL mapping is adapted (after all, that is still in flux) in which case the third option may also work. 
> 
> This could fall under any of the options. If the ontology defined rdfs:domain for all properties, the URI can be determined by matching against all predicates in the scope of the associated type, otherwise, the rule would be to use the type URI as the basis, after removing the bit after '/' or '#'.
> 
> A processor would only need to provide explicit vocabulary support when the URI pattern falls outside of the default URI generation algorithm, and schema.org does fall under this pattern (as do most any other standard OWL/RDFS vocabularies)

Ah! I thought we had a problem with schema.org but, indeed, looking at:

http://schema.org/docs/schemaorg.owl

it seems that we do not. So schema.org falls under the generic scheme, which is good.

> 
>> For most of the RDF vocabularies, I actually think that our current approach (ie, third option, or the 'base' of the first option) on mapping would actually work fairly well. I have the gut feeling that it would cover a vast majority of vocabularies.
> 
> Except for those defined in the HTML spc, yes.

Sure, but those are documented and fixed, so that is fine.

> 
>> For microformats, I am, first of all, not sure that there is already a 'standard' on how a specific microformat vocabulary is mapped on microdata. But, as far as I know, there isn't any standard on a microformat->RDF generation either. What this means is that a generic decision may as well work right away, without further ado; we do not have some sort of a backward compatibility issue.
> 
> The html5 spec defines vocabularies for vCard, vEvent and license using Hixie's criteria.

Does it? I am looking at 

http://dev.w3.org/html5/md/

and I do not find anything. Is there any other HTML5 document that does it?


> Otherwise, I don't think there's a standard in place, although schema.org or data-vocabulary.org could be an effective replacement.
> 

Yes, well, this is a different discussion that is not the topic of this task force. The bottom line is that the microformat->RDF is fully open, no real practice...


>> What I am getting at is that defining a generic mapping as of now, and allowing the processors to have knowledge about the specificities of a particular vocabulary may not be so dramatic as it sounds. Ie, I believe that most of the vocabularies will work just fine with the generic mapping, and there will be only a few cases (say, schema.org) that would require knowledge. We may simply say that an update of those extra knowledge is published by the W3C once every, say, 6 months, a bit like the default prefixes in RDFa are handled. That may even be machine readable, with a very simple vocabulary, without going into the complexities of OWL. In other words, a mixture of your first and last option may just work in practice.
> 
> My preference would be option 3, where there is a single way to parse, without reference to vocabulary specifics. This would be in direct contrast with the HTML spec, but more in lines of the needs of RDF tool chains.
> 

Yes, I see. I would like to see a use case where this does not work (and, from the top of my head, I do not see any).

Except that (hence my explicit cc to Martin): let us suppose that, eventually, the GR terms will be incorporated into schema.org. What this means is that all GR terms will be in the schema.org/ namespace, but that also means that a microdata->RDF mapping would produce the GR terms as 

http://schema.org/Blah

which is different than the current terms. Martin, what are your plans with the old URI-s in that situation?


>> For the datatype: well, the fact is, that microdata does not have datatypes. I think we should just accept that. The resulting RDF has a vocabulary URI; for specific vocabularies these may refer to an RDFS or OWL file, and RDF processors may want to pick those up and massage the RDF generated by the microdata conversion. That is outside the conversion itself, and is in the realm of the 'usual' management of RDF data. And if a specific vocabulary is very dependent on datatypes well, then, sorry, do not use microdata in the first place! As it seems Google will, eventually, understand schema.org in RDFa 1.1, too, so for those cases RDFa 1.1 is also at the user's disposal without further issues.
> 
> I've suggested elsewhere that this type of massaging may be done through a post-processing stage, much like RDFa vocab entailment, however, it could also take advantage of datatype range information, which would cause the replacement of literals, thus it may be a step which generates a new graph.

The problem is that this would make it different than RDFa @vocab. RDFa @vocab considers subproperties and subclasses; if applied, and if the @vocab contains, say, 

<a> rdfs:subPropertyOf <b> .

then, if he RDFa contains something with <a>, the final graph would include

<x> <a> <y> .
<x> <b> <y> .

which is perfectly o.k. But if we apply the same approach for datatypes and range, then we would get something like

<w> <q> "123", "123"^^xsd:int .

in case the range of <q> is set to xsd:int. But this is not entirely kosher, what we want to do is to have

<w> <q> "123"^^xsd:int .

only. In other words, the management of the vocabulary should be incorporated into the core processing of the microdata->RDF mapping, and not as some sort of a optional post-processing step, which is the RDFa @vocab case.

Ivan

> 
>> I am not yet sure what to do about the list handling issue, I must admit. I think I would prefer not to generate lists by default.
> 
> I'd rather it wasn't there either.
> 
>> Ok, now shoot at me:-)
>> 
>> Ivan
> 
> Gregg
> 
>>> [1] http://lists.w3.org/Archives/Public/public-html-data-tf/2011Oct/0085.html
>>> [2] http://lists.w3.org/Archives/Public/public-html-data-tf/2011Oct/0118.htm
> [3] http://www.w3.org/TR/2011/WD-microdata-20110525/#rdf
> 
>> ----
>> Ivan Herman, W3C Semantic Web Activity Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> PGP Key: http://www.ivan-herman.net/pgpkey.html
>> FOAF: http://www.ivan-herman.net/foaf.rdf
>> 
>> 
>> 
>> 
>> 
> 


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
PGP Key: http://www.ivan-herman.net/pgpkey.html
FOAF: http://www.ivan-herman.net/foaf.rdf
Attachments

application/pkcs7-signature attachment: smime.p7s
Received on Thursday, 20 October 2011 07:22:25 UTC