Re: schema.org and proto-data, was Re: schema.org as reconstructed from the human-readable information at schema.org from Christian Bizer on 2013-10-30 (public-vocabs@w3.org from October 2013)

From: Christian Bizer <chris@bizer.de>
Date: Wed, 30 Oct 2013 09:57:30 +0100
To: "'Peter Patel-Schneider'" <pfpschneider@gmail.com>
Cc: "'Guha'" <guha@google.com>, "'Martin Hepp'" <martin.hepp@unibw.de>, "'W3C Vocabularies'" <public-vocabs@w3.org>
Message-ID: <061e01ced54e$10d6c8c0$32845a40$@bizer.de>
Hi Peter,

 

while I agree that better documentation and examples are always a plus, I
think the problem lies elsewhere.

 

Let's take the example of  JobPostings again. Schema.org defines lots of
nice properties for describing job postings including "skills",
"qualifications", and "responsibilities". But these properties are not used
by the data providers which describe job postings mostly (50% of the sites
that we examined) using the properties "title", "jobLocation", and
"description".

 

I think that the reason for this are the schemata used by most of today's HR
databases. All of these databases are likely to have a job title and job
description field, but many won't have skills, qualifications, and
responsibilities fields and also the departments of the companies deliver
job postings as free-text to the HR department and not nicely split into
different fields.

 

So what do you do as a webmaster in charge of publishing your companies job
postings on the Web?

 

You edit the PHP-script or other script that produces the HTML pages and add
Schema.org markup. This is a 10 minutes job.

Convincing all the departments of your company to deliver job postings to
you in a different, more structured format would be a large project and the
departments are likely not to cooperate as they don't see the benefits of
the whole endeavor.

 

So the problem is not missing documentation or that the webmaster is stupid,
but that the webmaster currently cannot do anything about it.

 

I think the adoption path of the more specialized properties will be as
follows:

 

1.       Many websites roughly markup their content using a minimal set of
schema.org terms. This is happening now.

2.       The major search engines like Google extract "skills",
"qualifications", and "responsibilities" from the free-text of the
description field using NLP techniques and start providing sophisticated job
search features (similar to the features provided by specialized job portals
today).

3.       The departments of our example company recognize that the search
engines make errors in guessing the features from the free-text and that
their job postings are thus harder to find than the job postings of a
competitor.

4.       Thus, they ask the HR or IT department what to about this and a
process is started inside the company to capture job postings in a more
structured way and to extent the current HR database with the required
fields for this.

 

So the major driver for getting more structured data onto the Web are
mainstream applications consuming it. The rich snippets provided by search
engines today are a nice start, but I honestly hope that the major search
engines are already working on features such as improved job search and that
such features will be deployed soon.

 

Especially for the job market, this is beneficial for everybody. Job seekers
get better market transparency as they don't need to visit different job
portals anymore, but can find all job postings in a single portal (the
search engine). For companies offering jobs this is also better as their add
reaches more people and as they don't need to pay portals like Monster or
StepStone thousands of dollar for the add anymore.

 

Cheers,

 

Chris

 

 

 

Von: Peter Patel-Schneider [mailto:pfpschneider@gmail.com] 
Gesendet: Mittwoch, 30. Oktober 2013 00:23
An: Christian Bizer
Cc: Guha; Martin Hepp; W3C Vocabularies
Betreff: Re: schema.org and proto-data, was Re: schema.org as reconstructed
from the human-readable information at schema.org

 

The first kind of behaviour described below is, perhaps, a flaw, which can
be fairly easily fixed by turning the text into a simple item.  The main
wrinkle is whether the text becomes the name of the item or a description of
the item.

The second kind of behaviour is not a flaw at all, in that there is good
data on the pages.   Of course, one might want to do better by analyzing the
text on the page to do better, but that isn't required.  And why stop at the
text in the schema.org property values?  As suggested elsewhere, another way
to improve the situation here is to have better examples so that content
providers can more easily produce better data.

The biggest issue with determining what works and what doesn't in schema.org
is that there is no real description of either the data model of schema.org
or the meaning (informal or formal) of data in this model.   Hopefully this
will be forthcoming shortly.   (I have a short document on what I would hope
that the result looks like.)

 

peter

 

On Tue, Oct 29, 2013 at 12:57 PM, Christian Bizer <chris@bizer.de> wrote:

Hi Peter,

if you want two concrete examples illustrating the "quality of the
understanding between the many minds (developers) in this eco-system" Martin
is talking about, here they are:

1. http://schema.org/JobPosting clearly states that the values of the
property "hiringOrganization" should be of type "Organization". Nevertheless
40% of the Scehma.org JobPosting instances that we found on the Web did
contain a simple string as the value of this property.

2. http://schema.org/Product defines over 20 properties that can be used to
describe products. Nevertheless out of the 14.000 websites that we found to
use this class, around 50% only use three properties to describe products:
Name, Description and Image.

The first flaw can easily be fixed when pre-processing Schema.org data.
Fixing the second problem requires some more sophisticated NLP techniques to
guess product features from the free text in the fields Name and
Description, for instance if you want to do identity resolution in order to
find out which of the 14.000 websites is offering a specific iPhone.

No magic and no non-deterministic algorithms, just the normal dirty stuff
that makes data integration successful, with the small, but relevant
difference that you know because of the markup that you are looking at
product descriptions and not at arbitrary web pages.

If you (or anybody else) wants to find out more about how schema.org is used
in the wild, you can download 1.4 billion quads Schema.org data originating
from 140,000 websites from http://webdatacommons.org/ or take a look at

https://github.com/lidingpku/iswc-archive/raw/master/paper/iswc-2013/8219001
7-deployment-of-rdfa-microdata-and-microformats-on-the-web-a-quantitative-an
alysis.pdf

which gives you some basic statistics about commonly used classes and
properties.

So no need to be a mayor search engine to explore this space and get an
understanding about the kind of knowledge modeling that is understood by
average webmasters ;-)

Cheers,

Chris
Received on Wednesday, 30 October 2013 08:57:57 UTC