W3C home > Mailing lists > Public > public-html-data-tf@w3.org > October 2011

Inferring data type from a vocabulary or content

From: Jayson Lorenzen <Jayson.Lorenzen@businesswire.com>
Date: Thu, 20 Oct 2011 10:22:52 -0700
Message-Id: <4E9FF67C0200007E000A1A72@sfgwia1.businesswire.com>
To: "Gregg Kellogg" <gregg@kellogg-assoc.com>
Cc: "public-html-data-tf@w3.org" <public-html-data-tf@w3.org>
In the thread titled: "Re: Microdata to RDF: First Editor's Draft (ACTION-6)"

There was discussion about being able to have parsers and software infer the data type of a property from the vocabulary or the content being parsed.  I would like point out a use-case where this is either not working or being misused the majority of the time. There are a lot of Schema.org properties that state their type as "Text" but since HTML is what is being marked up, there is also a lot of HTML formatting in the data these properties are identifying. Some distillers/parsers do not treat this data as HTML/XML and the formatting of the content is lost. 

Jeni actually has a section in the nice blog posts about converting between RDF and Microdata regarding the lack of a means to state that the content of an element is  http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral.  It would be nice if this case were added to the use cases to be addressed. Since this is all about marking up HTML, it seems likely that some HTML will be caught as content for properties quite often. 

Here are some examples of where this is happening today. 

GoodReads are probably using the "awards" property incorrectly here, (each <a> should probably get its own itemprop="awards") but even so, parsers like the RichSnip tool and any23 do not treat this data as HTML and the formatting is lost:


<div class="infoBoxRowItem" itemprop='awards'>
   <a href="/award/show/9-hugo-award" class="award">Hugo Award for Best Novel (1985)</a>, <a href="/award/show/23-nebula-award" class="award">Nebula Award for Best Novel (1984)</a>, <a href="/award/show/326-philip-k-dick-award" class="award">Philip K. Dick Award (1984)</a>, <a href="/award/show/1403-john-w-campbell-memorial-award" class="award">John W. Campbell Memorial Award Nominee for Best Science Fiction Novel (1985)</a>  

Then in the reviews section of the Telegraph, their review body content, is HTML with formatting, but the parsers loose this formatting when extracting the reviewBody property which is defined at Schema.org as Text


<div id="mainBodyArea" itemprop="reviewBody">
  <div class="firstPar"><p>
     <strong>The Butchers Arms</strong>, Woolhope, Herefordshire HR1 4RF <br>
     <strong>Contact </strong>01432 860281; food@butchersarmswoolhope.co.uk).<br>
     <strong>Price </strong>Three courses with a couple of pints, or half a bottle of 
     wine and coffee: &pound;35-40 per head</p></div>
 <div class="secondPar"><p>
     For a man whose first career was in advertising, Stephen Bull is no huge fan 
     of the hard sell. &ldquo;It&rsquo;s all much of a muchness, really,&rdquo; 
     replied the owner of The Butchers Arms near Hereford when asked to recommend a
     couple of his dishes. &ldquo;All pretty mediocre.&rdquo;

I will be able to point to more examples later on the IPTC site once there are some rNews samples available in Microdata (I am working on some actually). The articleBody property of Schema.org/Article is probably always going to get formatted text in it once news orgs start using it, since it is being used in HTML. 



Jayson Lorenzen
Senior Software Engineer
B  U  S  I  N  E  S  S       W  I  R  E 
A Berkshire Hathaway Company
+1.415.986.4422, ext. 766 
+1.415.956.2609 (fax) 
Business Wire/San Francisco 
44 Montgomery St. 39th Floor
San Francisco, CA 94104

Please Note:  

The information in this Business Wire e-mail message, and any files transmitted with it, is confidential and may be legally privileged. It is intended only for the use of the individual(s) named above. If you are the intended recipient, be aware that your use of any confidential or personal information may be restricted by state and federal privacy laws. If you, the reader of this message, are not the intended recipient, you are hereby notified that you should not further disseminate, distribute, or forward this e-mail message. If you have received this e-mail in error, please notify the sender and delete the material from any computer.
Received on Thursday, 20 October 2011 17:25:07 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 20:08:25 UTC