htmldata-ISSUE-1 (Microdata Vocabulary): Vocabulary specific parsing for Microdata from HTML Data Task Force Issue Tracker on 2011-10-18 (public-html-data-tf@w3.org from October 2011)

From: HTML Data Task Force Issue Tracker <sysbot+tracker@w3.org>
Date: Tue, 18 Oct 2011 18:29:37 +0000
To: public-html-data-tf@w3.org
Message-Id: <E1RGEPx-0007C8-OX@barney.w3.org>

htmldata-ISSUE-1 (Microdata Vocabulary): Vocabulary specific parsing for Microdata

http://www.w3.org/2011/htmldata/track/issues/1

Raised by: Gregg Kellogg
On product:

HTML5 Microdata requires that different vocabularies be processed using normative rules defined in a specification. As Hixie points out in [1],

hixie> Incidentally, note that you can't just take, say, an RDF vocabulary, or a
hixie> Microformats vocabulary, and just use it in microdata directly. A
hixie> microdata vocabulary has to define processing rules that are often not
hixie> provided for RDF and Microformats vocabularies, and has to use the terms
hixie> defined in the HTML specification to describe how the terms work. You can
hixie> see examples of how to define vocabularies in the HTML standard:

This requirement is at odds with typical vocabulary usage in RDF, where an RDF processor is capable of handling a vocabulary without having built-in knowledge of it's specific processing rules.

This relates to Microdata to RDF processing in several different ways:

* A vocabulary may define specific rules for generating property URIs from tokens. For example,
the vCard vocabulary defines properties to be specific to each type, rather than common across
the vocabulary. Thus, the _fn_ property for a vCard might would have the URI
http://microformats.org/profile/hcard#fn, rather than http://microformats.org/profile/fn.
(Actually, also prefixed by http://www.w3.org/1999/xhtml/microdata# with type fragment escaped).

Further more, the _family-name_ property is defined to occur only inside an _n_ property. This
might be interpreted as http://microformats.org/profile/hcard#:n%20family-name.

* A vocabulary may use normal RDF property URI generation based on the vocabulary of the type.
For example, if type is http://xmlns.com/foaf/0.1/Person, the property _name_ takes the URI form
of http://xmlns.com/foaf/0.1/name. (See mfhepp discussion in [2]

There is currently no normative way for a vocabulary to define generic property URI generation rules.
This could be done by looking at predicates that have the type as an rdfs:domain and choosing
a predicate which ends with the property _name_.

Alternatively, a vocabulary could define an owl:AnnotationProperty with a pre-defined set of rules
for generating property URIs from names and types. For example

md:propertyURIGeneration a owl:AnnotationProperty;
rdfs:range [ a owl:Class; owl:allPropertiesFrom <PropertyScheme> ] .

md:vocabularyProperty a <PropertyScheme> .

md:typeProperty a <PropertyScheme> .

Vocabulary descriptions could then contain such an annotation property as a predicate on the
owl:Ontology.

* Normal Microdata processing of multi-valued properties MUST maintain the order of property values.
The current Microdata to RDF specification does this by coercing multiple values into an RDF Collection.
For example, if a foaf:Group listed multiple items as members, it might be encoded as

<Group> a foaf:Group; foaf:member (<gregg> <mfhepp>)

This is at odds with the needs of many vocabularies, such as Good Relations, which does not want
to maintain property order and cannot process a list [2].

It has also been noted that, for example, foaf:member has an rdfs:range of foaf:Agent, not rdf:List,
so a collection is an inappropriate object for the foaf:member predicate.

Vocabulary-aware processing would know not to create an RDF Collection in this case.

* Other than for the <time> element (eventually <data>, perhaps), literals in Microdata are untyped.
Good Relations requires that certain values have an RDF datatype. For example, the property
gr:hasCurrencyValue has a range of xsd:float. Vocabulary-aware processing could automatically
create a literal with the appropriate datatype.

There are different directions the specification could take:

* The specification requires that each vocabulary used within a document have a documented
vocabulary, and the processor has vocabulary-specific rules built into the processor. Documents
using unrecognized vocabularies fall back to a base-level processing, similar to that currently defined.
This includes vocabulary specific rules for multi-valued properties, property URI generation and
property datatype coercion.

* Processors extract the vocabulary from @itemtype and attempt to load a RDFS/OWL definition.
Property URIs are created by looking for appropriate predicates defined (or referenced) from within
this document. Values are coerced to rdf:List only if the predicate has a range of rdf:List. Value
datatypes are coerced to the appropriate datatype based on lexical value matching if there is more
than one, or by using the specific datatype if only one is listed.

* We use a generic mapping of Microdata to RDF that does not depend on vocabulary-specific rules.
This may include using RDF Collections for multiple values and vocabulary-relative property URI naming,
or not as we decide. The processor then may be at odds with defined property generation and
value ordering rules from HTML5 Microdata.

[1] http://lists.w3.org/Archives/Public/public-html-data-tf/2011Oct/0085.html
[2] http://lists.w3.org/Archives/Public/public-html-data-tf/2011Oct/0118.html

Received on Tuesday, 18 October 2011 18:29:42 UTC