Re: Datatypes (Was: Re: Consumer guidance) from Tantek Çelik on 2011-11-23 (public-html-data-tf@w3.org from November 2011)

From: Tantek Çelik <tantek@cs.stanford.edu>
Date: Wed, 23 Nov 2011 15:19:08 +0000
To: "Jeni Tennison" <jeni@jenitennison.com>,"HTML Data Task Force WG" <public-html-data-tf@w3.org>
Cc: "Ivan Herman" <ivan@w3.org>
Message-ID: <620243451-1322061550-cardhu_decombobulator_blackberry.rim.net-631142240-@b11.c1>
Jeni, you wrote:

>  RDFa -- both microdata and microformats assume that consumers will have  built-in knowledge of the vocabulary that they are consuming.

Slight corrections:

* Yes microformats 1 do require parsers have built-in knowledge of the vocabulary that they are consuming.

* However microformats 2 by design has no such general requirement nor assumption of consumers, aside from a very small set of backward compat vocabularies which are fixed (and smaller than say the # of predefined named X11/SVG colors defined in CSS3 Color) - no further microformats 1 style vocabs are being developed.


> We can also make a similar point on the consumption side: that generic consumers can only pick up information from RDFa.

Strictly speaking not true.

Generic consumers can absolutely pickup all necessary information from microformats 2 syntax (again by design), and at least some generic information from microdata syntax as well. E.g. an HTML5 Drag & Drop implementation can do generic parsing of microformats 2 and microdata, convert them to a standard (and interoperable) JSON data model, and incorporate them into the data being dragged/dropped.


Finally:

>And we can bring out the guidance on the vocabulary side about not making vocabularies where the datatype of a value can't be determined from the property and its syntax.

The double negative in that statement is confusing.

I'm not sure how this is necessary. I'd need specific examples of how this helps to understand what you're saying.

In general microformats have worked quite well with string-like properties overall. The only exceptions are those we've learned from real world experience and illustrated by the time element (also in dt-* parsing in microformats 2). Even the e-* syntax converts embedded markup/DOM to a string of markup, which then a consumer expecting markup may treat specially.

Thanks,

Tantek


-----Original Message-----
From: Jeni Tennison <jeni@jenitennison.com>
Date: Wed, 23 Nov 2011 14:53:24 
To: HTML Data Task Force WG<public-html-data-tf@w3.org>
Cc: Ivan Herman<ivan@w3.org>
Subject: Datatypes (Was: Re: Consumer guidance)


On 23 Nov 2011, at 09:48, Ivan Herman wrote:
> On Nov 22, 2011, at 22:48 , Jeni Tennison wrote:
>> As far as I can see, the only time the ability to annotate values with datatypes makes a difference is if the type of the value of a property cannot be inferred from the property and the syntax of the value. Personally, I've been convinced that vocabularies in which that's the case are hard to use and likely to lead to bad data.
> 
> If you refer to an automatic inference of type, I tend to agree with you.

No, I mean that vocabularies that have properties where the type of the value of the property cannot be automatically inferred by the syntax of the value are badly designed.

But you are right that if you do have a vocabulary like that then you need to have a syntax that enables you to label values with datatypes, which means using RDFa.

> What it means for publishers is that if the data and its consumption is dependent on datatypes (or at least would be significantly better using them) then RDFa is a better choice which provides a clear typing facility. (The only exception may be the <time> element.) 


I think that this is an area where there is some deep disagreement. 

On one hand, the argument is that if publishers are given the ability to label the datatypes of data in their pages then consumers can do something useful with it even when they don't know the vocabulary. For example, items that have some property where all the values are numeric can be sorted numerically without a processor knowing whether the items are products and the numbers are prices or the items are people and the numbers are IQs or whatever.

On the other hand, the argument is that useful consumers always have built-in knowledge about the vocabulary that they understand, so they know what datatypes to expect for each property. Given that, relying on publishers to supply a datatype for each value is problematic because (a) they might get it wrong, by assigning an incorrect datatype or no datatype at all, so a consumer always has to fix up those mistakes anyway and (b) it gives publishers more work to do when we want to make their lives easy.

I think we can probably square this by saying quite near the beginning of the Publisher guidance something like:

  Most consumers of HTML data will only recognise particular vocabularies
  that cover the information that they are interested in. Generic
  consumers perform operations don't require up-front knowledge of the
  vocabulary, either by using only the information available in the page
  (in particular datatype information) or by fetching a machine-readable 
  representation of the vocabulary in order to display things like labels
  or explanatory text. This second form of consumer is only supported by
  RDFa -- both microdata and microformats assume that consumers will have
  built-in knowledge of the vocabulary that they are consuming.

We can reiterate that in talking about syntax considerations.

We can also make a similar point on the consumption side: that generic consumers can only pick up information from RDFa.

And we can bring out the guidance on the vocabulary side about not making vocabularies where the datatype of a value can't be determined from the property and its syntax.

Does that sound right?

Jeni
-- 
Jeni Tennison
http://www.jenitennison.com
Received on Wednesday, 23 November 2011 15:19:44 UTC