Re: Generic Property-Value Proposal for Schema.org

Dear Niklas,

On 02 May 2014, at 20:10, Niklas Lindström <lindstream@gmail.com> wrote:

> Hi all,
> 
> I do understand the case for capturing structured values describing special properties of products. The way proposed certainly makes for less normalized, precise and reusable data than simple properties with direct values enable. But it has some merit in the fact that, as Martin says, a little data goes a long way.

Thanks!
> 
> The shape of data that this produce merits some analysis. For one, the idea to some extent resembles a mix of statement reification and structured values. More interestingly though, the pattern is similar to an effect that can be achieved by defining very specific SKOS concepts (for e.g. battery types, operating systems, screen sizes and bluetooth types), and linking to them with e.g. a productSpecification property. To me, these PropertyValue entities really look like a free-form version of such concept/topic/enum entities, with their plain text names representing their "type", rather than a generic property extension (in the RDF sense). And I see the potential in that.

Yes, one could say that my proposal is similar to "SKOS for properties" ;-)
But after the long discussion this has triggered, I would like to downplay the proposal to the very tangible area of application for product properties and places properties. I am personally convinced that the pattern is of generic value, for it strikes a balance between preserving data structure and data semantics while minimizing the effort for a data publisher. But let's separate this aspect.

> 
> (And although such ambiguous data can be hard to collate (and translate), it has its place, just as plain text keywords for basic website SEO have, in a primitive fashion. External enumerations (using e.g. SKOS) are often far more usable and scalable in the long term though.)
> 
Actually, I think that processing the resulting data is less difficult as many assume, since from an NLP perspective, the space of possible interpretations is much smaller, and you will have a lot of contextual information. But the consumption of the data should not be our main concern at this point

In general, a problem in our discussion has been that the perspective of data publication and data consumption have been mixed. Of course we all agree that the resulting data is more effort to process than standardized properties, compare

Ideal Version: External Property with Qualitative Value

<div itemscope itemtype="http://schema.org/Product">
  <span itemprop="name">ACME Electric Anvil</span>
...
  Operating Voltage: <div itemprop="http://acme.org/vocab/#voltage" itemscope 
       itemtype="http://schema.org/QuantitativeValue">
      <span itemprop="minValue">100</span>-
      <span itemprop="maxValue">220</span> 
      <meta itemprop="unitCode" content="VLT" > V
</div>

with this

Variant 1: Property name instead of URI

<div itemtype="http://schema.org/Product">
  <span itemprop="name">ACME Electric Anvil</span>
  <div itemprop="additionalProperty" itemscope itemtype="http://schema.org/PropertyValue">
	  <span itemprop="name">Operating Voltage</span>
	  <span itemprop="minValue">100</span>-
	  <span itemprop="maxValue">250</span>
	  <meta itemprop="unitCode" content="VLT"> V
  </div>  
</div>

or this

Variant 2: Unit as text instead of UN/CEFACT Common Code and range as a single field


<div itemtype="http://schema.org/Product">
  <span itemprop="name">ACME Electric Anvil</span>
  <div itemprop="additionalProperty" itemscope itemtype="http://schema.org/PropertyValue">
	  <span itemprop="name">Operating Voltage</span>
	  <span itemprop="value">100-250</span>-
	  <span itemprop="unitText">V</span>
  </div>  
</div>

or in worst case this:

Variant 3: Range and Unit in a joint field

<div itemtype="http://schema.org/Product">
  <span itemprop="name">ACME Electric Anvil</span>
  <div itemprop="additionalProperty" itemscope itemtype="http://schema.org/PropertyValue">
	  <span itemprop="name">Operating Voltage</span>
	  <span itemprop="value">100-250 V</span>-
  </div>  
</div>


It is obvious that the version with a dedicated property URI and a proper http://schema.org/QuantitativeValue node is easier to process.

But from a data provider's perspective, who typically has the product properties in very light-weight property-value structures, with often proprietary properties, even the step to Variant 1 makes data publication much, much simpler, because he does not have to map the local property name to a standard property URI nor determine the type of the value (quantitative, qualitative, or Boolean). That is VERY difficult from typical Web applications, even if the back-end systems (PDM/PIM) had this additional data.

From a data consumer's perspective, however, even the lightest version

<div itemtype="http://schema.org/Product">
  <span itemprop="name">ACME Electric Anvil</span>
  <div itemprop="additionalProperty" itemscope itemtype="http://schema.org/PropertyValue">
	  <span itemprop="name">Operating Voltage</span>
	  <span itemprop="value">100-250 V</span>
  </div>  
</div>

is still much easier to consume and lift than

<div itemtype="http://schema.org/Product">
  <span itemprop="name">ACME Electric Anvil</span>
  <div>
	  <span>Operating Voltage</span>
	  <span>100-250 V</span>
  </div>  
</div>

And I expect that most sites could easily reach the level of Variant 1 or Variant 2.

> That said, I would be *very* cautious of promoting this shape as a property extension mechanism. That would be done at the expense of using a mix of vocabularies for specialized data.

As already said, I am perfectly fine with postponing this possibility to the future and constraining this to Product and Place for the moment.
> 
> My opinion, based on experience in both consuming data and working to unify disparate descriptions, is that, in the general case of needing specific properties beyond the core or schema.org, it would be quite valuable to apply the existing mechanism of mixing vocabularies, native to RDF and the enabler of decentralized vocabulary growth. It has been there from the start and proven extremely valuable in specific data integration scenarios. Of course, it has the downside of enabling, in Richard's words, a "cacophony of multiple vocabulary choices". But by grounding basic common terms in schema.org, we have one stable core around which other things can revolve and evolve.

> RDFa has great support for this, especially in compact form through the means of prefixes. But all of RDFa, JSON-LD and microdata support using full URIs as property names, so it can be catered for in general. (Also, JSON-LD has more powerful support through both prefix and direct term definitions in a context.)
> 

Yes, I am perfectly fine with this. If site-owners are able to publish data according to external product ontologies, like the 40+ we developed for GoodRelations, or the new GPC ontology, that is great. And my proposal does not aim at stealing this opportunity.

However, note that there are THREE bottlenecks with using external vocabularies:

1. There must be a suitable vocabulary.

2. Site-owners must be able to map their local data to the external ontologies and publish respective data. In the past ten years, I have been able to convince just ONE site to use eClassOWL at broad scale. The problem is that the data supply for this is quite challenging.

3. Most people I speak to basically say that for their production sites, they do only what is specifified in schema.org. Unless the sponsors of schema.org explicitly endorse the use of a certain external vocabulary, this will not have a big adoption, IMO. Adding the proposed elements to schema.org in contrast will make it much easier to convince owners of this valuable data to make it available for search engines and other clients.

This is not a technical issue of course, just a signal. But it will matter.

> As an example, here is what the example product table in the proposal could look like, when adding an external vocabulary (also capturing some keywords and using some external enumerations):
> 
>     <div vocab="http://schema.org/"
>         prefix="pto: http://www.productontology.org/id/
>                 unit: http://qudt.org/vocab/unit#
>                 apple: http://apple.com/def/product#">
>       <table typeof="Product pto:IPhone_5">
>         <caption>iPhone 5 Specifications</caption>
>         <tr>
>           <th>Spec</th>
>           <th>Value</th>
>           <th>Description</th></tr>
>         <tr>
>           <td>LTE Band and Mode</td>
>           <td><span property="keywords apple:cellphoneBand">4G</span>
>             <span property="keywords apple:cellphoneMode">LTE</span></td>
>           <td></td></tr>
>         <tr>
>           <td>Battery Type</td>
>           <td property="keywords apple:batteryType">lithium-ion</td>
>           <td></td></tr>
>         <tr>
>           <td><a property="apple:productFeature"
>               href="http://apple.com/def/feature/handheld#Built-In%20GPS">Built-In GPS</a></td>
>           <td>Yes</td>
>           <td></td></tr>
>         <tr>
>           <td property="apple:productFeature">Touch Screen</td>
>           <td>Yes</td>
>           <td></td></tr>
>         <tr>
>           <td>Operating System</td>
>           <td property="operatingSystem">Apple iOS 7</td>
>           <td></td></tr>
>         <tr>
>           <td>Screen Size</td>
>           <td><span property="width apple:screenSize" datatype="unit:Inch">4</span>"</td>
>           <td>Size of the screen, in inches, measured diagonally from corner to corner.
>         </td></tr>
>         <tr property="keywords">
>           <td>Bluetooth Version</td>
>           <td property="apple:bluetoothVersion">4.0</td>
>           <td></td></tr>
>         <tr>
>           <td>Keyboard Type</td>
>           <td property="keywords apple:keyboardType">Virtual QWERTY</td>
>           <td></td></tr>
>         <tr property="hasPart" typeof="Thing pto:Camera">
>           <td><span property="name">Front Facing Camera</span> MP Rating</td>
>           <td property="apple:megaPixelRating">1.2</td>
>           <td></td></tr>
>         <tr property="hasPart" typeof="Thing pto:Camera">
>           <td><span property="name">Rear Facing Camera</span> MP Rating</td>
>           <td property="apple:megaPixelRating">8</td>
>           <td></td></tr>
>       </table>
>     </div>
> 
> If Apple were to use their own properties like this, they can describe them in a page at <http://apple.com/def/product>, using:
> 
>     <body vocab="http://schema.org/">
>       <h1 property="name">Apple Product Vocabulary</h1>
>       <h2>Properties</h2>
>       <article id="batteryType" resource="#batteryType" typeof="Property">
>         <h3 property="name">Battery Type</h3>
>         <p property="description">The type of battery.</p>
>       </article>
>       ...
>     </body>

My problem with this proposal is that it is, as far as I understand, RDFa-centric. And I think our approach should be syntax neutral. 
> 
> It would be very valuable to examine what deployment problems this pattern might have encountered in the past. Perhaps understanding of it has matured in recent times? Use of embedded data in general, schema.org in particular, and the pattern of multiple types from different vocabularies has certainly increased a lot, so I would like to see if this has become more palatable. It really is quite simple:
> 
> 1. Create a page describing your special properties.
> 2. Use these terms within your product pages.

We have tried to foster a similar pattern in shop applications for GoodRelations, but this has really not worked well. And this proposal again mixed the perspective of data consumption and data publication. It is unnecessary to put the burden of defining a local vocabulary on a Web site, when the only purpose is to cross-reference this in other parts of the site. This is something a consuming client can do as well. 
> 
> From there, improvements can be made, such as sharing, integrating and reusing these properties across endeavours (linking them as much as possible). And of course promoting the most common of these terms for inclusion in schema.org itself (and again, linking together the "wild" sources with these new core terms).
> 
> In practise, this requires the backers (search engines) of schema.org to promote and utilize the rich potential here. By collecting valuable external properties, and eventually enabling the most common of them to be shown in e.g. rich snippets. It requires effort of course, but since the terms are more structured than plain text, it doesn't require disambiguation heuristics, full NLP and such. (Which is the very case for structured data in pages over raw scraping and powerful text analysis.)
> 
> Cheers,
> Niklas

As far as I can see, my current proposal achieves basically the same with pretty straightforward markup, available in all syntaxes, and in a way that allows various degrees of granularity. Site owners will be able to preserve all granularity (e.g. if value and unit are two field) and data semantics (e.g. if they can serve a numerical range as min and max or have a public identifier for a property).

Let us make it as simple as possible for sites to expose rich product data. That should be our first priority.

Martin

Received on Friday, 2 May 2014 20:38:35 UTC