Re: Inferring data type from a vocabulary or content (ISSUE-2) from Gregg Kellogg on 2011-10-20 (public-html-data-tf@w3.org from October 2011)

From: Gregg Kellogg <gregg@kellogg-assoc.com>
Date: Thu, 20 Oct 2011 14:43:04 -0400
To: Jayson Lorenzen <Jayson.Lorenzen@businesswire.com>
CC: Gregg Kellogg <gregg@kellogg-assoc.com>, "public-html-data-tf@w3.org" <public-html-data-tf@w3.org>
Message-ID: <5AAB8D90-DA99-4D43-9915-13E6E1959839@greggkellogg.net>
Hi Jayson, thanks for the note:

On Oct 20, 2011, at 10:22 AM, Jayson Lorenzen wrote:

> In the thread titled: "Re: Microdata to RDF: First Editor's Draft (ACTION-6)"
> 
> There was discussion about being able to have parsers and software infer the data type of a property from the vocabulary or the content being parsed.  I would like point out a use-case where this is either not working or being misused the majority of the time. There are a lot of Schema.org properties that state their type as "Text" but since HTML is what is being marked up, there is also a lot of HTML formatting in the data these properties are identifying. Some distillers/parsers do not treat this data as HTML/XML and the formatting of the content is lost. 
> 
> Jeni actually has a section in the nice blog posts about converting between RDF and Microdata regarding the lack of a means to state that the content of an element is  http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral.  It would be nice if this case were added to the use cases to be addressed. Since this is all about marking up HTML, it seems likely that some HTML will be caught as content for properties quite often. 
> 
> 
> Here are some examples of where this is happening today. 
> 
> GoodReads are probably using the "awards" property incorrectly here, (each <a> should probably get its own itemprop="awards") but even so, parsers like the RichSnip tool and any23 do not treat this data as HTML and the formatting is lost:
> 
> http://www.goodreads.com/book/show/14770.Neuromancer
> 
> <div class="infoBoxRowItem" itemprop='awards'>
>   <a href="/award/show/9-hugo-award" class="award">Hugo Award for Best Novel (1985)</a>, <a href="/award/show/23-nebula-award" class="award">Nebula Award for Best Novel (1984)</a>, <a href="/award/show/326-philip-k-dick-award" class="award">Philip K. Dick Award (1984)</a>, <a href="/award/show/1403-john-w-campbell-memorial-award" class="award">John W. Campbell Memorial Award Nominee for Best Science Fiction Novel (1985)</a>  
> 
> </div>

Yes, in this case, Microdata will take the text content of the div, and eliminate the markup. If they had intended to list the URL of each award, they would need to put an @itemprop on each anchor element.

The RDFa processing rules would detect that this was a mixed element and create an XMLLiteral. To borrow an example from RDFa, we could consider the following:

<div itemscope itemtype="http://schema.org/Book">
  <h2 itemprop="name">E = mc<sup>2</sup>: The Most Urgent Problem of Our Time</h2>
</div>

Should this create an XMLLiteral and preserve the (HTML) markup, or only extract the text. The current processing rules would extract text. I've created ISSUE-2 to track discussion.

> Then in the reviews section of the Telegraph, their review body content, is HTML with formatting, but the parsers loose this formatting when extracting the reviewBody property which is defined at Schema.org as Text

Arguably, for the schema.org case, extracting text content is the appropriate thing to do, if that is the correct interpretation of Text in schema.org. This would be best taken up with the Vocabularies task force [1]

> http://www.telegraph.co.uk/foodanddrink/restaurants/8824409/The-Butchers-Arms-Woolhope-Herefordshire-restaurant-review.html
> 
> <div id="mainBodyArea" itemprop="reviewBody">
>  <div class="firstPar"><p>
>     <strong>The Butchers Arms</strong>, Woolhope, Herefordshire HR1 4RF <br>
>     <strong>Contact </strong>01432 860281; food@butchersarmswoolhope.co.uk).<br>
>     <strong>Price </strong>Three courses with a couple of pints, or half a bottle of 
>     wine and coffee: &pound;35-40 per head</p></div>
> <div class="secondPar"><p>
>     For a man whose first career was in advertising, Stephen Bull is no huge fan 
>     of the hard sell. &ldquo;It&rsquo;s all much of a muchness, really,&rdquo; 
>     replied the owner of The Butchers Arms near Hereford when asked to recommend a
>     couple of his dishes. &ldquo;All pretty mediocre.&rdquo;
> </p>
>  ...
> </div>
> 
> 
> 
> I will be able to point to more examples later on the IPTC site once there are some rNews samples available in Microdata (I am working on some actually). The articleBody property of Schema.org/Article is probably always going to get formatted text in it once news orgs start using it, since it is being used in HTML. 
> 
> thanks
> 
> j
> 

Gregg

[1] http://lists.w3.org/Archives/Public/public-vocabs/

> 
> Jayson Lorenzen
> Senior Software Engineer
> ____________________________ 
> B  U  S  I  N  E  S  S       W  I  R  E 
> A Berkshire Hathaway Company
> 
> +1.415.986.4422, ext. 766 
> +1.415.956.2609 (fax) 
> www.BusinessWire.com
> 
> Business Wire/San Francisco 
> 44 Montgomery St. 39th Floor
> San Francisco, CA 94104
> 
> 
> 
> 
> Please Note:  
> 
> The information in this Business Wire e-mail message, and any files transmitted with it, is confidential and may be legally privileged. It is intended only for the use of the individual(s) named above. If you are the intended recipient, be aware that your use of any confidential or personal information may be restricted by state and federal privacy laws. If you, the reader of this message, are not the intended recipient, you are hereby notified that you should not further disseminate, distribute, or forward this e-mail message. If you have received this e-mail in error, please notify the sender and delete the material from any computer.
> 
>
Received on Thursday, 20 October 2011 18:44:00 UTC