Datatypes (Was: Re: Consumer guidance) from Jeni Tennison on 2011-11-23 (public-html-data-tf@w3.org from November 2011)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Wed, 23 Nov 2011 14:53:24 +0000
To: HTML Data Task Force WG <public-html-data-tf@w3.org>
Cc: Ivan Herman <ivan@w3.org>
Message-Id: <AD23D6AA-E812-4D47-BF61-A57591A0041C@jenitennison.com>

On 23 Nov 2011, at 09:48, Ivan Herman wrote:
> On Nov 22, 2011, at 22:48 , Jeni Tennison wrote:
>> As far as I can see, the only time the ability to annotate values with datatypes makes a difference is if the type of the value of a property cannot be inferred from the property and the syntax of the value. Personally, I've been convinced that vocabularies in which that's the case are hard to use and likely to lead to bad data.
> 
> If you refer to an automatic inference of type, I tend to agree with you.

No, I mean that vocabularies that have properties where the type of the value of the property cannot be automatically inferred by the syntax of the value are badly designed.

But you are right that if you do have a vocabulary like that then you need to have a syntax that enables you to label values with datatypes, which means using RDFa.

> What it means for publishers is that if the data and its consumption is dependent on datatypes (or at least would be significantly better using them) then RDFa is a better choice which provides a clear typing facility. (The only exception may be the <time> element.) 

I think that this is an area where there is some deep disagreement. 

On one hand, the argument is that if publishers are given the ability to label the datatypes of data in their pages then consumers can do something useful with it even when they don't know the vocabulary. For example, items that have some property where all the values are numeric can be sorted numerically without a processor knowing whether the items are products and the numbers are prices or the items are people and the numbers are IQs or whatever.

On the other hand, the argument is that useful consumers always have built-in knowledge about the vocabulary that they understand, so they know what datatypes to expect for each property. Given that, relying on publishers to supply a datatype for each value is problematic because (a) they might get it wrong, by assigning an incorrect datatype or no datatype at all, so a consumer always has to fix up those mistakes anyway and (b) it gives publishers more work to do when we want to make their lives easy.

I think we can probably square this by saying quite near the beginning of the Publisher guidance something like:

  Most consumers of HTML data will only recognise particular vocabularies
  that cover the information that they are interested in. Generic
  consumers perform operations don't require up-front knowledge of the
  vocabulary, either by using only the information available in the page
  (in particular datatype information) or by fetching a machine-readable 
  representation of the vocabulary in order to display things like labels
  or explanatory text. This second form of consumer is only supported by
  RDFa -- both microdata and microformats assume that consumers will have
  built-in knowledge of the vocabulary that they are consuming.

We can reiterate that in talking about syntax considerations.

We can also make a similar point on the consumption side: that generic consumers can only pick up information from RDFa.

And we can bring out the guidance on the vocabulary side about not making vocabularies where the datatype of a value can't be determined from the property and its syntax.

Does that sound right?

Jeni
-- 
Jeni Tennison
http://www.jenitennison.com

Received on Wednesday, 23 November 2011 14:53:55 UTC