htmldata-ISSUE-1 (Microdata Vocabulary): Vocabulary specific parsing for Microdata

htmldata-ISSUE-1 (Microdata Vocabulary): Vocabulary specific parsing for Microdata

http://www.w3.org/2011/htmldata/track/issues/1

Raised by: Gregg Kellogg
On product: 

HTML5 Microdata requires that different vocabularies be processed using normative rules defined in a specification. As Hixie points out in [1],

hixie> Incidentally, note that you can't just take, say, an RDF vocabulary, or a 
hixie> Microformats vocabulary, and just use it in microdata directly. A 
hixie> microdata vocabulary has to define processing rules that are often not 
hixie> provided for RDF and Microformats vocabularies, and has to use the terms 
hixie> defined in the HTML specification to describe how the terms work. You can 
hixie> see examples of how to define vocabularies in the HTML standard:

This requirement is at odds with typical vocabulary usage in RDF, where an RDF processor is capable of handling a vocabulary without having built-in knowledge of it's specific processing rules.

This relates to Microdata to RDF processing in several different ways:

* A vocabulary may define specific rules for generating property URIs from tokens. For example,
  the vCard vocabulary defines properties to be specific to each type, rather than common across
  the vocabulary. Thus, the _fn_ property for a vCard might would have the URI
  http://microformats.org/profile/hcard#fn, rather than http://microformats.org/profile/fn.
  (Actually, also prefixed by http://www.w3.org/1999/xhtml/microdata# with type fragment escaped).

  Further more, the _family-name_ property is defined to occur only inside an _n_ property. This
  might be interpreted as http://microformats.org/profile/hcard#:n%20family-name.

* A vocabulary may use normal RDF property URI generation based on the vocabulary of the type.
  For example, if type is http://xmlns.com/foaf/0.1/Person, the property _name_ takes the URI form
  of http://xmlns.com/foaf/0.1/name. (See mfhepp discussion in [2]

  There is currently no normative way for a vocabulary to define generic property URI generation rules.
  This could be done by looking at predicates that have the type as an rdfs:domain and choosing
  a predicate which ends with the property _name_.

  Alternatively, a vocabulary could define an owl:AnnotationProperty with a pre-defined set of rules
  for generating property URIs from names and types. For example

    md:propertyURIGeneration a owl:AnnotationProperty;
      rdfs:range [ a owl:Class; owl:allPropertiesFrom <PropertyScheme> ] .

    md:vocabularyProperty a <PropertyScheme> .

    md:typeProperty a <PropertyScheme> .

  Vocabulary descriptions could then contain such an annotation property as a predicate on the
  owl:Ontology.

* Normal Microdata processing of multi-valued properties MUST maintain the order of property values.
  The current Microdata to RDF specification does this by coercing multiple values into an RDF Collection.
  For example, if a foaf:Group listed multiple items as members, it might be encoded as

    <Group> a foaf:Group; foaf:member (<gregg> <mfhepp>)

  This is at odds with the needs of many vocabularies, such as Good Relations, which does not want
  to maintain property order and cannot process a list [2].

  It has also been noted that, for example, foaf:member has an rdfs:range of foaf:Agent, not rdf:List,
  so a collection is an inappropriate object for the foaf:member predicate.

  Vocabulary-aware processing would know not to create an RDF Collection in this case.

* Other than for the <time> element (eventually <data>, perhaps), literals in Microdata are untyped.
  Good Relations requires that certain values have an RDF datatype. For example, the property
  gr:hasCurrencyValue has a range of xsd:float. Vocabulary-aware processing could automatically
  create a literal with the appropriate datatype.

There are different directions the specification could take:

* The specification requires that each vocabulary used within a document have a documented
  vocabulary, and the processor has vocabulary-specific rules built into the processor. Documents
  using unrecognized vocabularies fall back to a base-level processing, similar to that currently defined.
  This includes vocabulary specific rules for multi-valued properties, property URI generation and
  property datatype coercion.

* Processors extract the vocabulary from @itemtype and attempt to load a RDFS/OWL definition.
  Property URIs are created by looking for appropriate predicates defined (or referenced) from within
  this document. Values are coerced to rdf:List only if the predicate has a range of rdf:List. Value
  datatypes are coerced to the appropriate datatype based on lexical value matching if there is more
  than one, or by using the specific datatype if only one is listed.

* We use a generic mapping of Microdata to RDF that does not depend on vocabulary-specific rules.
  This may include using RDF Collections for multiple values and vocabulary-relative property URI naming,
  or not as we decide. The processor then may be at odds with defined property generation and
  value ordering rules from HTML5 Microdata.

[1] http://lists.w3.org/Archives/Public/public-html-data-tf/2011Oct/0085.html
[2] http://lists.w3.org/Archives/Public/public-html-data-tf/2011Oct/0118.html

Received on Tuesday, 18 October 2011 18:29:42 UTC