Re: Datatypes (Was: Re: Consumer guidance) from Jeni Tennison on 2011-11-23 (public-html-data-tf@w3.org from November 2011)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Wed, 23 Nov 2011 17:55:36 +0000
To: tantek@cs.stanford.edu
Cc: "HTML Data Task Force WG" <public-html-data-tf@w3.org>, "Ivan Herman" <ivan@w3.org>
Message-Id: <A00DFEA5-4F6C-4550-A878-77F9B4602028@jenitennison.com>
Thanks Tantek,

On 23 Nov 2011, at 15:19, Tantek Çelik wrote:
> Generic consumers can absolutely pickup all necessary information from microformats 2 syntax (again by design), and at least some generic information from microdata syntax as well. E.g. an HTML5 Drag & Drop implementation can do generic parsing of microformats 2 and microdata, convert them to a standard (and interoperable) JSON data model, and incorporate them into the data being dragged/dropped.

OK, let's try to put together some wording together for a separate section on generic consumers. Here's a start, but I'd appreciate input about what microformats-2 processors can and can't do, particularly around locating additional machine-readable information about the vocabulary:

  Microdata, RDFa and microformats-2 all use a generic syntax, which means
  that it's possible to have generic parsers operate over them to extract
  data. In the case of microdata and microformats-2, the data has a JSON
  structure; data extracted from RDFa has a RDF structure (microdata can
  also be converted into RDF).

  Generic applications can work in the browser to do things such as 
  highlighting markup that follows a particular syntax or enabling users
  to download the data embedded within a page into a separate file. These
  can also use the context in which the HTML data is found to provide
  additional features. For example, generic consumers may detect that
  each row in a table is associated with a distinct entity, and each cell
  with a particular property, and enable users to sort that table based 
  on property values. In this case, a consumer could ensure that when 
  values are marked up as dates, times or durations using the <time> 
  element, the items are sorted by date/time/duration rather than 
  alphabetically.

  Both microformats-2 and RDFa provide additional facilities that enable 
  publishers to indicate the type of values to support generic consumers. 
  Microformats-2 properties have a prefix that can indicate when a value 
  is a URL (u-*), a date/time (dt-*), extended HTML (e-*) or a string 
  (p-*). RDFa supports a @datatype attribute that publishers can use to
  indicate the datatype of a value, usually an XML Schema datatype such
  as xsd:integer or xsd:language. Note that once microformats-2 data is
  extracted from a page into JSON, these prefixes are no longer available,
  so a consumer of the JSON has to know the vocabulary to tell whether a 
  given value should be interpreted as a string or as HTML markup, for 
  example. In contrast, the datatypes used to annotate RDFa values are
  carried within the RDF data.

  RDFa also adheres to a follow-your-nose principle, whereby vocabulary 
  authors are encouraged to provide a machine-readable description of 
  classes and properties at the URL used for the class or property. This 
  can enable generic processors to automatically pick up additional 
  information about the class or property such as labels, help text, 
  superclasses, property cardinality and ranges and so on. While microdata 
  also uses URLs for types and properties, microdata consumers are not 
  permitted to dereference URLs that they do not already recognise.

>> And we can bring out the guidance on the vocabulary side about not making vocabularies where the datatype of a value can't be determined from the property and its syntax.
> 
> The double negative in that statement is confusing.
> 
> I'm not sure how this is necessary. I'd need specific examples of how this helps to understand what you're saying.

Well, as an example, there's a particular RDF vocabulary, SKOS, which states that the skos:notation property can be used to give a code for a skos:Concept. Some concepts might have codes from different coding schemes. So that vocabulary says that the RDF datatype of the skos:notation value should be used to indicate the type of the coding scheme. So you're actually encouraged in this vocabulary to end up with something like:

  <dog> a skos:Concept ;
    skos:notation "3-12"^^eg:CodingScheme1 ;
    skos:notation "7-53"^^eg:CodingScheme2 ;
    .

There's some more about this at

  http://patterns.dataincubator.org/book/custom-datatype.html

What I was trying to say is that this pattern is bad in HTML data vocabularies, because it limits what syntaxes can be used with the vocabulary (you have to use RDFa) and because it places burden on publishers and leads to unreliable data. It should always be possible for a vocabulary-aware application to tell the type of a value based on (a) what property its given as a value for and (b) what the syntax of the value is.

Cheers,

Jeni
-- 
Jeni Tennison
http://www.jenitennison.com
Received on Wednesday, 23 November 2011 17:56:13 UTC