Re: Multiple itemtypes in microdata from Gregg Kellogg on 2011-10-19 (public-html-data-tf@w3.org from October 2011)

From: Gregg Kellogg <gregg@kellogg-assoc.com>
Date: Tue, 18 Oct 2011 21:13:12 -0400
To: Ian Hickson <ian@hixie.ch>
CC: Gregg Kellogg <gregg@kellogg-assoc.com>, Bradley Allen <bradley.p.allen@gmail.com>, Stéphane Corlosquet <scorlosquet@gmail.com>, "public-html-data-tf@w3.org" <public-html-data-tf@w3.org>
Message-ID: <EBA01772-89A0-4697-BA40-D7C4514B81EF@greggkellogg.net>
On Oct 18, 2011, at 5:00 PM, Ian Hickson wrote:

> On Tue, 18 Oct 2011, Gregg Kellogg wrote:
>> 
>> Hixie, note that I raised property URI generation as ISSUE-1 [1] (along 
>> with other transformation issues). From reading the HTML/Microdata spec, 
>> it would seem that processors really need to have vocabulary-specific 
>> rules for interpreting these rules. This is important for property URI 
>> generation, but also for maintaining value order and specifying 
>> per-property literal datatypes.
> 
> Yes, all the use cases for microdata were things where it made no sense 
> for software to do anything with the data unless it knew what the data 
> meant, so the assumption is that the microdata processing software knows 
> the vocabulary.

I think there are plenty of good examples of how to do vocabularies with RDF-based vocabularies. Schema.org seems to have done a reasonable job using a flat namespace. This handles contact information, calendar events and most every other type of markup you could want.

> This is similar to how XML processors are expected to have 
> namespace-specific knowledge to be useful. Sure, you can have generic XML 
> or microdata (or JSON or...) parsers, but to do anything useful with the 
> data, you have to stick those parsers onto a frontend that knows about the 
> data itself.

Yes, an application will have to process the data in a meaningful way for that application. For example, the Structured Data Linter (http://linter.structured-data.org) processes a variety of data just to make snippets (intended to give authors some idea of what their markup might look like in a hypothetical result page). Music software, such as Seevl performs music discovery using data marked up with Music Ontology and schema.org. These are examples of applications that make use of standardized vocabularies and RDF markup that describe data relationships to perform novel applications. This data is published in a specific and open way using the best vocabularies available to enable such applications. If we were to design data markup only for a specific application, it would not be able to scale in the same way; this is the fundamental principle behind Linked Open Data.

>> The alternatives are:
>> 
>> 1) bake in support for each vocabulary into a conformant processor
> 
> This is the assumption that microdata is built around.

Doesn't scale, and requires a processor revision for each new vocabulary. Someone needs to act as a gate keeper. If the decision to provide specific support is left up to each processor implementer, the interpretation of the data becomes variable (and therefore useless). If it's intended that each application do it's own processing from HTML, it places a burden on application providers who would much rather leave the semantic extraction to standardized tools IMO.

>> 2) read a vocabulary document (i.e., RDFS or OWL) and determine 
>> processing rules from rdfs:range/rdfs:domain specifications
> 
> Generally speaking, no language exists that is expressive enough to 
> actually describe vocabularies in sufficient detail to make this practical 
> for the kinds of vocabularies that microdata's use cases involve.

You say this, and yet a number of such vocabularies have, in fact, been created and are in use today. I'm unclear on what is special about the vocabularies described in HTML (vCard, vEvent, Licensing) that is so complicated that FOAF, schema.org, and Creative Commons haven't been able to get it right? If it's the application of this data for a specific application, then yes, it would require something more sophisticated to both express the semantics of the data and how it must be used and presented. That's why separating the task into data representation and data interpretation and presentation are best done separately.

>> 3) do nothing, use a single processing algorithm that is generic across 
>> all vocabularies and leave it to post-processing to perform 
>> vocabulary-specific modifications. (Although this does not really 
>> address property URI generation variation between vocabularies defined 
>> in HTML and other RDF vocabularies).
> 
> I don't really understand what this means. What does RDF have to do with 
> microdata in this context?

In the context of a Microdata to RDF transformation, I would think that would be obvious.

>> Note, that if the HTML spec specified 
>> http://microformats.org/profile/hcard# as the vCard type, instead of 
>> just http://microformats.org/profile/hcard, properties would be 
>> generated relative to the type using processing rules currently 
>> described in [2], which is intended to be compatible with 
>> schema.org<http://schema.org> and other RDF vocabularies.
> 
> The properties in the microdata vCard vocabulary aren't URLs, and it would 
> be incorrect to treat them as URLs. They are "defined property names" in 
> the sense defined in the HTML specification.
> 
> This has implications. For example, it would be invalid to treat these two 
> microdata fragments as equivalent in any way:
> 
>   <address itemscope itemtype="http://microformats.org/profile/hcard">
>    Written by
>    <span itemprop="fn">
>     <span itemprop="n" itemscope>
>      <span itemprop="given-name">Jill</span>
>      <span itemprop="family-name">Darpa</span>
>     </span>
>    </span>
>   </address>
> 
>   <address itemscope itemtype="http://microformats.org/profile/hcard">
>    Written by
>    <span itemprop="http://microformats.org/profile/hcard#fn">
>     <span itemprop="http://microformats.org/profile/hcard#n" itemscope>
>      <span itemprop="http://microformats.org/profile/hcard#n/given-name">Jill</span>
>      <span itemprop="http://microformats.org/profile/hcard#n/family-name">Darpa</span>
>     </span>
>    </span>
>   </address>

Within the context of the HTML definition of that vocabulary, you're correct. By definition, _given-name_ only has meaning within the context of _n_. With an RDFS representation of the vocabulary, even if _n_ and _given-name_ are placed in a flat namespace, the validity of applying each to a given object can be determined given appropriate RDFS domain/range definitions and type inference of the object _n_ references. OWL2 allows even more specificity, including the fact that you might have only a single value for _family-name_, but allow multiple for _given-name_.

> Any software that handled the above in equivalent ways (e.g. finding a 
> vCard with a name "Jill Darpa" in the second case) would be non-conforming 
> implementations of the vCard microdata vocabulary.

It could just mean that the vCard HTML vocabulary isn't compatible with the Microdata to RDF definition, in that case. Although, I'm afraid I still don't understand the specific requirements that make it so, given the ability to indicate domain and range with RDFS.

> (This is why when there was a generic HTML to RDF conversion algorithm in 
> the HTML spec, it went to some lengths to ensure that the URLs generated 
> on the RDF side could not be present in conforming microdata -- it ensured 
> that there was no way to end up in this confusing situation where two 
> different conforming property names had the same semantic.)

Yet in a way that was broadly considered unsatisfying to the majority of RDF consumers. That is why the HTML Data task force ended up taking this on.

Gregg

> -- 
> Ian Hickson               U+1047E                )\._.,--....,'``.    fL
> http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
> Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 19 October 2011 01:14:16 UTC