Re: Generic Property-Value Proposal for Schema.org

Justin:
On 30 Apr 2014, at 23:55, Justin Boyan <jaboyan@google.com> wrote:

> Martin,
> 
> Can you give some examples of how this style of data could be used by a search engine or aggregator to drive interesting features? It seems like it's pushing too much work to the consumer side. Every different website/producer will come up with their own different terminology for the same attributes, which sort of defeats the purpose of a common vocabulary. 
> 
> Thanks,
> Justin

The problem this propoosal addresses is that there is no way to achieve ex ante consensus on the tens of thousands of product feature properties. It is not feasible to standardize them in schema.org, and even if that was possible, this will give you no guarantees that the data provided is actually based on the standardized semantics. See Stuart Madnick: "Oh, so That Is What You Meant! The Interplay of Data Quality and Data Semantics", http://link.springer.com/chapter/10.1007%2F978-3-540-39648-2_2.

Thus, we should allow them to expose as much meta-data as they can in the given setting. This is what the new http://schema.org/PropertyValue element is all about. Keep in mind that your baseline for comparison are HTML tables with plain text.

Here are a few ideas of how search engines could be using the data from the new schema.org elements in a rather straightforward way:

I. Preprocessing

1. For a set of products of the same type, consolidate properties based on simple or advanced NLP over their names. For instance, use stemming etc. to normalize property names. Then define a similarity metric (maybe start with Hamming distance, Levenshtein,...) and use this to define a threshold for consolidation. This will likely give you a power-law distribution of property names. For the more popular ones, you can assume a certain degree of semantic equivalence (use user feedback mechanisms to improve / train your components). Basically standard stuff ;-)

2. For qualitative values, do the same - simply treat them as named entities.

3. For quantitative values, canonicalize them based on the unit information. For the beginning, honor just UN/CEFACT codes and encourage Web masters to provide this. For a short list, see http://wiki.goodrelations-vocabulary.org/Documentation/UN/CEFACT_Common_Codes. Conversion factors are readily available from http://www.qudt.org/. This can be done in COBOL, if needed.

4. Cleanse values with simple regular expressions. For the beginning, ignore invalid data (and maybe tell site owners so via the Google Structured Data Testing tool; this is your most effective feedback channel to the respective developers).

5. Canonicalize point and ranges by filling both "min" and "max" with "value" if "value" exists.

6. Create an internal datastructure that stores the resulting data for each entity e.g. in the Google Knowledge Graph.


II. Usage

1. For vertical search applications, generate facets based on the most popular properties and value range from part I. I assume that eBay is using similar techniques for generating the dialogs shown in the attachment. You may also want to consider the paper "Structuring E-Commerce Inventory" by K. Mauge et al. from eBay Research Labs (PDF available at http://aclweb.org/anthology//P/P12/P12-1085.pdf).

2. For relevance in general Web search, you can link the factual knowledge to search terms, but I assume that Google knows better than I how do to that. For qualitative and boolean properties, this is pretty straightforward.

A page with

<div itemtype="http://schema.org/Car">
  <img itemprop="image" src="station_waggon123.jpg" />
  <span itemprop="name">Station Waggon 123</span>
  <div itemprop="additionalProperty" itemscope itemtype="http://schema.org/PropertyValue">
	  <span itemprop="name">Sunroof</span>
	  <meta itemprop="value" content="True">
  </div>  
  <div itemprop="additionalProperty" itemscope itemtype="http://schema.org/PropertyValue">
	  <span itemprop="name">Fuel type</span>
	  <link itemprop="value" href="http://dbpedia.org/resource/Diesel" />Diesel
  </div>  
</div>

should be ranked highly for a query for "Station Wagon Diesel Sunroof" ;-)

3. In order to expand the potential, use 
- the recognition of product model entities (e.g. from manufacturer pages) and
- strong identifiers, like GTIN13, EAN, UPC on retail sites

for applying #1 and #2 to a broader set of pages. I.e. if you know that a certain shop sells a certain commodity, and you have model information from the manufacturer's site, replicate that. An example of rules for doing this (using the original GoodRelations namespace) are here http://wiki.goodrelations-vocabulary.org/Axioms (Rules 1.1.1 and 1.2.1).

III. The Unreasonable Effectiveness of Data

1. Encourage leading manufacturers of commodities like consumer electronics, cars, ... to mark-up their datasheets. These companies are dying to use structured data to better articulate their value proposition to Google. Tell them to speak to you in data.

2. Wait until many have done so.

3. Watch https://www.youtube.com/watch?v=yvDCzhbjYWs or speak to Peter Norvig.

I promise you that this type of data will be very delicious input for any future processing of e-commerce information from the Web.


;-)

Martin






On 30 Apr 2014, at 23:55, Justin Boyan <jaboyan@google.com> wrote:

> Martin,
> 
> Can you give some examples of how this style of data could be used by a search engine or aggregator to drive interesting features? It seems like it's pushing too much work to the consumer side. Every different website/producer will come up with their own different terminology for the same attributes, which sort of defeats the purpose of a common vocabulary. 
> 
> Thanks,
> Justin
> 
> On Wednesday, April 30, 2014, martin.hepp@ebusiness-unibw.org <martin.hepp@ebusiness-unibw.org> wrote:
> Dear Francois-Paul:
> On 30 Apr 2014, at 09:14, Francois-Paul Servant <francoispaulservant@gmail.com> wrote:
> 
> > Dear Martin,
> >
> > some remarks regarding your proposal.
> >
> > Regarding the motivations:
> > - I agree that there is a strong motivation for such a proposal, and you name it in your second design principle: "No Lifting and Cleansing Barrier: Do not force site owners to lift or cleanse existing data."
> > You may have very precise data describing your products in a table that you could very well publish it as they are, but it is difficult to map columns and cells to external vocabularies (if such vocabularies exist). It should be possible to lift the data later.
> 
> Great, thanks! I think automotive is a really nice example - we typically have lots of relevant car features, but it will be very tiring to define a global standard for all marketing-relevant features (and their authoritative translations etc.).
> 
> >
> > - I'm less convinced by the argument "generic extension mechanism for properties at the level of schema.org". As you note, using external properties is a problem in microdata. But it is not the case in RDFa or in JSON-LD: RDF, by itself, provides a generic pattern for exposing characteristics for entities. I don't think that it is a big effort for a site owner to mint a URI for an additional property.
> >
> That is perfectly fine from my perspective. In fact, this is just a by-product of the proposal and I wanted to disclose that properly.
> However, note that e.g. for smaller sites, it is indeed a -- at least perceived -- problem to mint a URI for an additional property. Even big automotive players had external support for defining their OWL vocabularies;-)
> 
> Think of hotels for instance - if they define room features, they will often not be able to use an existing URI nor define their own.
> 
> We are in agreement that
> 
> 1. in non-microdata syntax, external properties are in principle no problem and
> 2. in general, properly defined properties with a URI will be better, if available.
> 
> The proposal is about filling that gap.
> 
> 
> > Regarding the proposal itself: in order to avoid having to define many properties in schema.org, you propose an alternative, simplified way to write s p o triples when describing a resource s, using one and only one property, schema:additionalProperty, whose range is schema:PropertyValue. Basically, PropertyValue is a pair (property,value). You describe a PropertyValue using a few properties: schema:name, schema;value, schema:unitText, etc.
> >
> > I would keep and make explicit the (property,value) pair structure, using two dedicated properties (say): schema:property and schema:object, both with domain PropertyValue
> > Why? to make it possible to easily lift data published using schema:additionalProperty, in bulk.
> 
> If I understand you correctly, you are proposing to create individual nodes for the property name part and for the value part. I have looked at the proposal, but
> 
> - I see no gain in using a dedicated property node for the property name. If you already have a URI for the property, simply use propertyID with the URI of the property and omit the schema:name. That is as simple as your proposal.
> 
> - If you already have a URI for the value (e.g. for a qualitative value), you can use schema:value directly with that URI.
> 
> I will show that in your examples:
> 
> >
> > Let's take some of your examples to explain it:
> >
> > <div itemtype="http://schema.org/Product">
> >       <img itemprop="image" src="camera123.jpg" />
> >       <span itemprop="name">Digital Camera 123</span>
> >       <div itemprop="additionalProperty" itemscope itemtype="http://schema.org/PropertyValue">
> >               <span itemprop="name">Approx. Weight</span>
> >               <span itemprop="value">450</span>
> >               <span itemprop="unitText">gram</span>
> >       </div>
> >       <div itemprop="additionalProperty" itemscope itemtype="http://schema.org/PropertyValue">
> >               <span itemprop="name">Interface</span>:
> >               <span itemprop="value">USB</span>
> >       </div>
> > </div>
> >
> > that is in turtle (for lisibility):
> >
> > [     a schema:Product;
> >       schema:image x:camera123.jpg;
> >       schema:name "Digital Camera 123";
> >       schema:additionalProperty [
> >               a schema:PropertyValue;
> >               schema:name "Approx. Weight";
> >               schema:value "450";
> >               schema: unitText "gram"
> >       ];
> >       schema:additionalProperty [
> >               a schema:PropertyValue;
> >               schema:name "Interface";
> >               schema:value "USB";
> >       ]
> > ]
> 
> 
> Yes
> 
> >
> > I suggest to write instead:
> >
> > [     a schema:Product;
> >       schema:image x:camera123.jpg;
> >       schema:name "Digital Camera 123";
> >       schema:additionalProperty [
> >               a schema:PropertyValue;
> >               schema:property [
> >                       schema:name "Approx. Weight"
> >               ];
> >               schema:object [
> >                       schema:value "450";
> >                       schema: unitText "gram"
> >               ]
> >       ];
> >       schema:additionalProperty [
> >               a schema:PropertyValue;
> >               schema:property [
> >                       schema:name "Interface"
> >               ];
> >               schema:object [
> >                       schema:value "USB";
> >               ]
> >       ]
> > ]
> >
> > Not really different, not more difficult to produce, arguably more blank nodes.
> 
> In Microdata, it would be more difficult to produce, also, we would need (or should then at least have), a type for these subnodes.
> 
> Your proposal in Microdata would look as follows:
> 
> <div itemtype="http://schema.org/Product">
>         <img itemprop="image" src="camera123.jpg" />
>         <span itemprop="name">Digital Camera 123</span>
>         <div itemprop="additionalProperty" itemscope itemtype="http://schema.org/PropertyValue">
>                 <div itemprop="property" itemscope itemtype="http://schema.org/Property">
>                         <span itemprop="name">Approx. Weight</span>
>                 </div>
>                 <div itemprop="object" itemscope itemtype="http://schema.org/StructuredValue">
>                         <span itemprop="value">450</span>
>                         <span itemprop="unitText">gram</span>
>                 </div>
>         </div>
>         <div itemprop="additionalProperty" itemscope itemtype="http://schema.org/PropertyValue">
>                 <div itemprop="property" itemscope itemtype="http://schema.org/Property">
>                         <span itemprop="name">Interface</span>:
>                 </div>
>                 <div itemprop="object" itemscope itemtype="http://schema.org/QuantitativeValue">
>                         <span itemprop="value">USB</span>
>                 </div>
>         </div>
> </div>
> 
> That are 21 lines in comparison to the initial proposal with 13 lines:
> 
> <div itemtype="http://schema.org/Product">
>         <img itemprop="image" src="camera123.jpg" />
>         <span itemprop="name">Digital Camera 123</span>
>         <div itemprop="additionalProperty" itemscope itemtype="http://schema.org/PropertyValue">
>                 <span itemprop="name">Approx. Weight</span>
>                 <span itemprop="value">450</span>
>                 <span itemprop="unitText">gram</span>
>         </div>
>         <div itemprop="additionalProperty" itemscope itemtype="http://schema.org/PropertyValue">
>                 <span itemprop="name">Interface</span>:
>                 <span itemprop="value">USB</span>
>         </div>
> </div>
> 
> It is doable to modify the proposal, but from a Web markup perspective, I am not convinced. My main concern is not so much the additional code as such, but the experience that each additional level of nesting makes RDFa and Microdata coding more error-prone and intellectually more challenging.
> 
> Imagine doing this in a non-trivial table in RDFa or Microdata. It will be very painful.
> 
> 
> > The point is that in many cases, you have URIs for the values, or you can easily mint them from your own codification. And you can therefore easily produce, say:
> >
> > [     a schema:Product;
> >       schema:image x:camera123.jpg;
> >       schema:name "Digital Camera 123";
> >       schema:additionalProperty [
> >               a schema:PropertyValue;
> >               schema:property foo:approxWeight;
> >               schema:object [
> >                       schema:value "450";
> >                       schema: unitText "gram"
> >               ]
> >       ];
> >       schema:additionalProperty [
> >               a schema:PropertyValue;
> >               schema:property foo:interface;
> >               schema:object foo:USB
> >       ]
> > ]
> > foo:approxWeight schema:name "Approx. Weight".
> > foo:interface schema:name "Interface".
> > foo:USB schema:value "USB".
> >
> > The advantage here is that this data can be later improved, for instance stating:
> >
> > foo:approxWeight rdfs:subPropertyOf schema:weight.
> > foo:USB owl:sameAs dbpedia:USB.
> >
> > this can be done without any impact on the source systems, on the actual production of the data, or on data that are already published: you can write the statements above once and lift all corresponding records at once.
> 
> I think we should separate the issue of consuming this data in RDF worlds from the perspective of mark-up. My assumption of consuming such data in RDF worlds is that with SPARQL CONSTRUCT rules (and a few heuristics), RDF-based consumers will transform the property-value pairs into local schemas in RDFS or OWL or map the data to existing vocabularies (like http://purl.org/vso/ns).
> 
> As long as the nodes are blank nodes, you cannot add a name later on anyway, so SPARQL CONSTRUCT works as well.
> 
> It may not be obvious, but we only disagree on the tiny little bit whether future lifting and cleansing should happen on the original node (often a BNode), or in a copy of that data in the target data structure.
> 
> Note also that in pure RDF worlds, including RDFa, there is no strong need to use the new pattern. You can always use proper RDF or OWL properties. The only downside is that search engines may skip such additional properties.
> 
> If you are referring to externally defined URIs for the value or property, you can directly use those:
> 
> <div itemtype="http://schema.org/Car">
>   <img itemprop="image" src="station_waggon123.jpg" />
>   <span itemprop="name">Station Waggon 123</span>
>   <div itemprop="additionalProperty" itemscope itemtype="http://schema.org/PropertyValue">
>           <span itemprop="name">Gearbox Type</span>:
>           <link itemprop="value" href="http://purl.org/vvo/ns#GearboxDSG" />VW DSG
>           <link itemprop="propertyID" href="http://purl.org/vvo/ns#gearbox" />
>   </div>
> </div>
> 
> In RDFa and JSON-LD, you could of course directly use the equivalent of
> 
> s vvo:gearbox vvo:GearboxDSG .
> 
> But even in this bordeline case I think that my proposal has advantages, since a search engine can partly process the meta-data without fully understanding the external vocabulary.
> 
> 
> >
> > A question we would then ask is the question of rules than can be linked to the use of schema:additionalProperty. Is it equivalent to state:
> > s schema:additionalProperty [
> >       schema:property p;
> >       schema:object o
> > ]
> >
> > and s p o?
> >
> In my proposal: Formally, no. But a client would likely consolidate this.
> 
> However, I would like to limit the discussion of the exact processing of such data out of this thread, for eventually, the sponsors of schema.org will have to decide whether and how they will use such mark-up.
> 
> 
> > Also note that in many cases, you actually don't care about the property. An example describing cars:
> > [     a vso:Vehicle;
> >       schema:additionalProperty [
> >               schema:object [ schema:name "Sunroof" ]
> >       ],[
> >               schema:object dbpedia:Diesel
> >       ]
> > ]
> >
> > but we probably would prefer to write something like:
> > [     a vso:Vehicle;
> >       schema:feature [ schema:name "Sunroof"],
> >       schema:feature dbpedia:Diesel
> > ]
> >
> 
> I think that Sunroof: Yes and fuel type: Diesel would be better and not more diffcult to produce:
> 
> <div itemtype="http://schema.org/Car">
>   <img itemprop="image" src="station_waggon123.jpg" />
>   <span itemprop="name">Station Waggon 123</span>
>   <div itemprop="additionalProperty" itemscope itemtype="http://schema.org/PropertyValue">
>           <span itemprop="name">Sunroof</span>
>           <meta itemprop="value" content="True">
>   </div>
>   <div itemprop="additionalProperty" itemscope itemtype="http://schema.org/PropertyValue">
>           <span itemprop="name">Fuel type</span>:
>           <link itemprop="value" href="http://dbpedia.org/resource/Diesel" />Diesel
>   </div>
> </div>
> 
> 
> >
> > (note BTW that your use of schema:name for the PropertyValue is a bit incorrect, as you do not use it to label the PropertyValue pair, but the property. A schema:name for the second of the examples should probably be "Interface: USB" - but Ok, that's not important)
> That is a separate issue to discuss. I thought about schema:propertyName, but then again, it is in most cases redundant, and I see little harm in overloading schema:name here. I have added it to the list of issues.
> 
> 
> 
> > Best Regards,
> >
> > fps
> >
> 
> Thanks for your substantial feedback!
> 
> Best
> 
> Martin
> 
> > Le 29 avr. 2014 à 11:42, martin.hepp@ebusiness-unibw.org a écrit :
> >
> > Dear all:
> >
> > I have just finalized a proposal on how to add support for generic property-value pairs to schema.org. This serves three purposes:
> >
> > 1. It will allow to expose product feature information from thousands of product detail pages from retailers and manufacturers.
> > 2. It will simplify the development of future extensions for specific types of products and services, because we do no longer need to standardize and define all relevant properties in schema.org and can instead defer the interpretation to the client.
> > 3. It will serve as a clean, generic extension mechanism for properties in schema.org
> >
> > The proposal with all examples is here:
> >
> >   https://www.w3.org/wiki/WebSchemas/PropertyValuePairs
> >
> > Your feedback will be very welcome.
> >
> > Best wishes / Mit freundlichen Grüßen
> >
> > Martin Hepp
> > -----------------------------------
> > martin hepp  http://www.heppnetz.de
> > mhepp@computer.org          @mfhepp
> >
> >
> >
> >
> >
> >
> >
> 
> 

Received on Wednesday, 30 April 2014 22:56:13 UTC