Re: Microdata itemid and src / href from Gregg Kellogg on 2011-10-24 (public-html-data-tf@w3.org from October 2011)

From: Gregg Kellogg <gregg@kellogg-assoc.com>
Date: Mon, 24 Oct 2011 15:35:58 -0400
To: Jayson Lorenzen <Jayson.Lorenzen@businesswire.com>
CC: Philip Jägenstedt <philipj@opera.com>, "public-html-data-tf@w3.org" <public-html-data-tf@w3.org>
Message-ID: <A75B3C7F-BDBE-4BB2-9634-68DA59DA401F@greggkellogg.net>
On Oct 24, 2011, at 8:34 AM, Jayson Lorenzen wrote:

> Couple more apologies. First, sorry to go on and on about this, with
> long examples and test output and ranting again. Second, I think what I
> meant to say before was not about Vocabulary but maybe .... just
> "Different parsers can see the same thing very differently" and maybe
> the recommendations from this group should steer newbees in the
> direction that causes the least confusion. Now, onto more rant :)
>
> In reply to Philip, actually there was a copy paste mistake (extra
> closing div) in the HTML I had used as examples before as well as the
> odd way the <a> was added as you pointed out. I found that, indeed,
> testing with one more parser (any23) the outcome was wrong. HOWEVER, a
> little change in the V1 HTML and it works in both parsers, I mean
> produces the very same RDF, but with really different HTML. The Rich
> Snip Test Tool, however, follows the HTML and produces different output.
> See the two sets of examples below. These are labeled V1 and V2, and
> each set includes the HTML, extracts from two RDF distillers and the
> Rich Snip Test tool.
>
>
>
> ************************** V1 **************************
> ************************** V1 **************************
>
>    <body>
>        <div itemscope="itemscope"
> itemtype="http://schema.org/Organization"
>            itemid="http://businesswire.com">
>            <meta itemprop="name" content="Business Wire"/>
>        </div>
>
>        <div itemscope="itemscope"
> itemtype="http://schema.org/NewsArticle"
>            itemid="http://www.example.com/news/20110415123/">
>            <div itemprop="articleBody"> NEW YORK, NY--(Test
> News)--Stocks were mixed and bond
>                yields were at their lowest level in a year
> Thursday.... </div>
>            <a itemprop="copyrightHolder"
> href="http://www.businesswire.com">&#160;</a>
>        </div>
>    </body>
>
>
> ********************* any23 ******************
>
> <http://businesswire.com> a <http://schema.org/Organization> ;
>        <http://schema.org/Organization/name> "Business Wire" .
>
> <http://www.example.com/news/20110415123/> a
> <http://schema.org/NewsArticle> ;
>        <http://schema.org/NewsArticle/copyrightHolder>
> <http://www.businesswire.com> ;
>        <http://schema.org/NewsArticle/articleBody> """ NEW YORK,
> NY--(Test News)--Stocks were mixed and bond
>                yields were at their lowest level in a year
> Thursday.... """ .

Any23 constructs property URIs using a different method than my own distiller, or that originally recommended in the HTML Microdata spec. I actually filed a bug against this some time ago [3]. Still, the actual way to construct such URIs is one of the core tasks of this TF.

Any23 basically constructs property URIs by appending the property as a path element, which is inconsistent with the schema.org usage (IMHO). In this case, the "name" property is appended to the http://schema.org/Organization type to create http://schema.org/Organization/name.

Given that they've categorized this bug as a defect, I believe they will go with my suggested interpretation, at some point.

> ********************* greggKellogg *********
>
> <http://businesswire.com> a schema:Organization;
>   schema:name "Business Wire" .
>
> <http://www.example.com/news/20110415123/> a schema:NewsArticle;
>   schema:articleBody """ NEW YORK, NY--(Test News)--Stocks were mixed
> and bond
>                yields were at their lowest level in a year
> Thursday.... """;
>   schema:copyrightHolder <http://www.businesswire.com> .

So, this is essentially the same as what's generated by any23, except for the use of PNAMES and property URI generation, and is what the current Microdata to RDF draft uses.

You did leave out the item ordering, which should be the following:

<> md:item ( <http://businesswire.com> <http://www.example.com/news/20110415123/> ) .

(This is a bit different then the distiller generates, as it hasn't been updated to the latest yet, as defined in the Microdata to RDF spec [5], which links the items together as listed on the page.)

I don't personally see much value in this triple, but it's consistent with preserving the order of listed items on the page, but may be subject to further refinement by this group.

> ********************* rich snip tool *********
> Item
> Id:http://businesswire.com
> Type: http://schema.org/organization
> name = Business Wire
>
> Item
> Id:http://www.example.com/news/20110415123/
> Type: http://schema.org/newsarticle
> articlebody = NEW YORK, NY--(Test News)--Stocks were mixed and bond
> yields were at their lowest level in a year Thursday....
> copyrightholder = http://www.businesswire.com/

Given that this tool doesn't generate RDF, and doesn't need to create URIs for properties, it seems consistent with the other representations, otherwise. The other main difference is some normalization of the URIs (to lower case), which is fine for HTTP, but is actually different for RDF. And, the text has had whitespace removed or compressed, which is consistent with how a browser would present it anyway, but RDF preserves the original whitespace. (I've often thought that there should be a separate content-model for HTML which would allow whitespace to be canonicalized, such as removing leading and trailing whitespace and normalizing internal whitespace to a single space, but that's not how the specs treat it).

> ************************** V2 **************************
> ************************** V2 **************************
>
> <body itemscope="itemscope" itemtype="http://schema.org/NewsArticle"
> itemid="http://www.example.com/news/20110415123/">
>      <div itemprop="articleBody">
>        NEW YORK, NY--(Test News)--Stocks were mixed and bond yields
> were at their lowest level
>        in a year Thursday....
>      </div>
>
>    <div itemprop="copyrightHolder" itemscope="itemscope"
>      itemtype="http://schema.org/Organization"
> itemid="http://businesswire.com">
>      <meta itemprop="name" content="Business Wire"/>
>    </div>
>
>  </body>

In this case, you've made a change at the HTML item scope, but it doesn't really affect the general RDF modeling. The main difference is that http://businesswire.com is no longer a top-level item.

> ********************* any23 ******************
>
> <http://www.example.com/news/20110415123/> a
> <http://schema.org/NewsArticle> .
>
> <http://businesswire.com> a <http://schema.org/Organization> ;
>        <http://schema.org/Organization/name> "Business Wire" .
>
> <http://www.example.com/news/20110415123/>
> <http://schema.org/NewsArticle/copyrightHolder>
> <http://businesswire.com> ;
>        <http://schema.org/NewsArticle/articleBody> "\n      NEW YORK,
> NY--(Test News)--Stocks were mixed and bond yields were at their lowest
> level\n      in a year Thursday....\n    " .

Even though it looks like three different things, it's really just a Turtle serialization choice. The NewsArticle type could have been output with the rest of the data. From an RDF perspective, except for the mssing md:item entry, it's the same as the distiller's.

> ********************* greggKellogg *********
>
> <http://businesswire.com> a schema:Organization;
>   schema:name "Business Wire" .
>
> <http://www.example.com/news/20110415123/> a schema:NewsArticle;
>   schema:articleBody """
>      NEW YORK, NY--(Test News)--Stocks were mixed and bond yields were
> at their lowest level
>      in a year Thursday....
>    """;
>   schema:copyrightHolder <http://businesswire.com> .

It should also have the following triple, to express that the article is a top-level item:

<> md:item <http://www.example.com/news/20110415123/> .

> ********************* rich snip tool *********
>
> Item
> Id:http://www.example.com/news/20110415123/
> Type: http://schema.org/newsarticle
> articlebody = NEW YORK, NY--(Test News)--Stocks were mixed and bond
> yields were at their lowest level in a year Thursday....
> copyrightholder = Item( 1 )
>
> Item 1
> Id:http://businesswire.com
> Type: http://schema.org/organization
> name = Business Wire

Since the rich snip tool isn't outputing RDF, it's really just showing the relative item relationships. In this case, that the copyright holder is an item that has other information asserted against it. From a linked data perspective, they really say the same thing.

Gregg

> Jayson Lorenzen
> Senior Software Engineer
> ____________________________
> B  U  S  I  N  E  S  S       W  I  R  E
> A Berkshire Hathaway Company
>
> +1.415.986.4422, ext. 766
> +1.415.956.2609 (fax)
> www.BusinessWire.com
>
> Business Wire/San Francisco
> 44 Montgomery St. 39th Floor
> San Francisco, CA 94104
>
>
>
>>>>
> From:   Philip Jägenstedt<philipj@opera.com>
> To:     <public-html-data-tf@w3.org>
> Date:   10/24/2011 2:42 AM
> Subject:        Re: Microdata itemid and src / href
>
> On Mon, 24 Oct 2011 11:26:47 +0200, Philip Jägenstedt
> <philipj@opera.com>
> wrote:
>
>> On Sat, 22 Oct 2011 23:04:28 +0200, Jayson Lorenzen
>> <Jayson.Lorenzen@businesswire.com> wrote:
>>
>>> Sorry to go on and on about this but I just thought (while driving,
>
>>> dangerous) that this situation is an interesting example of of a
>>> Vocabulary specific parser behaving differently than a generic
> parser
>>> (that does not know about the Vocabulary). Here is what I mean.
> Using a
>>> generic parser (like Mr. Kellog's)
>>
>> Both of the following examples are invalid microdata and they don't
>
>> represent the same things. Details inline.
>>
>>>  <div itemscope="itemscope"
> itemtype="http://schema.org/Organization"
>>> itemid="http://example.com">
>>>    <meta itemprop="name" content="Example"/>
>>>  </div>
>>>
>>>   <a itemprop="myCompany" href="http://example.com">
>>
>> Validator.nu [1] will complain about the <a> element that "The
> itemprop
>> attribute was specified, but the element is not a property of any
> item."
>> The problem is that the <a> element is not a child of the <div>, so
> it's
>> just ignored. Live Microdata [2] gives this JSON output:
>>
>> {
>>   "items":[
>>     {
>>       "type":"http://schema.org/Organization",
>>       "id":"http://example.com/",
>>       "properties":{
>>         "name":[
>>           "Example"
>>         ]
>>       }
>>     }
>>   ]
>> }
>>
>> (Note that there is no myCompany property.)
>>
>>>  <div itemprop="myCompany" itemscope="itemscope"
>>> itemtype="http://schema.org/Organization"
>>>       itemid="http://example.com">
>>>      <meta itemprop="name" content="Example"/>
>>>   </div>
>>
>> Validator.nu will complain that "The itemprop attribute was
> specified,
>> but the element is not a property of any item." The issue here is
> that
>> there is no top-level item, since the outer item has an itemprop
>> attribute. Consequently, Live Microdata gives no output, simply
> noting
>> that "No top-level items found."
>>
>>> Produce the exact same RDF in a generic parser, but completely
>>> different results in the Google Rich Snippets Test tool.
>>
>> It sounds like there are bugs in the microdata parser used. Gregg,
> can
>> you take a look at this?
>>
>> [1] http://validator.nu/
>> [2] http://foolip.org/microdatajs/live/
>> [3] http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#top-level-microdata-items
[4] http://code.google.com/p/any23/issues/detail?id=173&q=property%20uri
[5] https://dvcs.w3.org/hg/htmldata/raw-file/24af1cde0da1/microdata-rdf/index.html
>
> Oops, the examples were expanded further down in the original mail:
>
> The JSON representation of those are:
>
> Version One:
>
> {
>   "items":[
>     {
>       "type":"http://schema.org/NewsArticle",
>       "id":"http://www.example.com/news/20110415123/",
>       "properties":{
>         "articleBody":[
>           "\n        NEW YORK, NY--(Test News)--Stocks were mixed and
> bond
> yields were at their lowest level\n        in a year Thursday.... "
>         ],
>         "copyrightHolder":[
>           "http://www.businesswire.com/"
>         ]
>       }
>     },
>     {
>       "type":"http://schema.org/Organization",
>       "id":"http://businesswire.com/",
>       "properties":{
>         "name":[
>           "Business Wire"
>         ]
>       }
>     }
>
> Version Two:
>
> {
>   "items":[
>     {
>       "type":"http://schema.org/NewsArticle",
>       "id":"http://www.example.com/news/20110415123/",
>       "properties":{
>         "articleBody":[
>           "\n        NEW YORK, NY--(Test News)--Stocks were mixed and
> bond
> yields were at their lowest level\n        in a year Thursday.... "
>         ],
>         "copyrightHolder":[
>           {
>             "type":"http://schema.org/Organization",
>             "id":"http://businesswire.com/",
>             "properties":{
>               "name":[
>                 "Business Wire"
>               ]
>             }
>           }
>         ]
>       }
>     }
>   ]
> }
>
> Given this, it's pretty plain to see why the RDF representation is the
>
> same: the use of itemid. I would suggest using Version Two.
>
> I'll also take the opportunity to complain (again) that the meaning of
>
> itemid is still not defined for schema.org, so strictly speaking using
> it
> is invalid.
>
> --
> Philip Jägenstedt
> Core Developer
> Opera Software
>
>
>
>
> Please Note:
>
> The information in this Business Wire e-mail message, and any files
> transmitted with it, is confidential and may be legally privileged. It
> is intended only for the use of the individual(s) named above. If you
> are the intended recipient, be aware that your use of any confidential
> or personal information may be restricted by state and federal privacy
> laws. If you, the reader of this message, are not the intended
> recipient, you are hereby notified that you should not further
> disseminate, distribute, or forward this e-mail message. If you have
> received this e-mail in error, please notify the sender and delete the
> material from any computer.
>
>
>
>
Received on Monday, 24 October 2011 19:37:56 UTC