RE: schema.org and proto-data, was Re: schema.org as reconstructed from the human-readable information at schema.org from LeVan,Ralph on 2013-10-30 (public-vocabs@w3.org from October 2013)

From: LeVan,Ralph <levan@oclc.org>
Date: Wed, 30 Oct 2013 18:57:41 +0000
To: Peter Patel-Schneider <pfpschneider@gmail.com>, Christian Bizer <chris@bizer.de>
CC: Guha <guha@google.com>, Martin Hepp <martin.hepp@unibw.de>, "W3C Vocabularies" <public-vocabs@w3.org>
Message-ID: <7d35f01306324f63869b87160991a8ff@BY2PR06MB091.namprd06.prod.outlook.com>

"Consume" is such a slippery word.  If all you want to do it read a lot of possibly malformed HTML, then you can do that with a lot of tools.  But, the more data you hope to intelligently extract from that HTML, the more intellectual resources you're going to have to throw at interpreting the vast amount of garbage you get.  So, the big boys are likely to do better than you or I at the vast majority of the cruft out there.  But, we can do pretty well with either low expectations or by getting our data from domains that map well to our own models/business needs.

Your mileage may vary.

Ralph

From: Peter Patel-Schneider [mailto:pfpschneider@gmail.com]
Sent: Wednesday, October 30, 2013 2:50 PM
To: Christian Bizer
Cc: Guha; Martin Hepp; W3C Vocabularies
Subject: Re: schema.org and proto-data, was Re: schema.org as reconstructed from the human-readable information at schema.org

Well, sure, getting more information into easy-to-consume form is a great idea, and there are paths towards this goal.

However, my question was whether consuming the data that is already in schema.org<http://schema.org> fields requires the resources of a major search company.   I would certainly hope not, but some posts here seemed to point that way.
peter

On Wed, Oct 30, 2013 at 1:57 AM, Christian Bizer <chris@bizer.de<mailto:chris@bizer.de>> wrote:
Hi Peter,

while I agree that better documentation and examples are always a plus, I think the problem lies elsewhere.

Let's take the example of  JobPostings again. Schema.org defines lots of nice properties for describing job postings including "skills", "qualifications", and "responsibilities". But these properties are not used by the data providers which describe job postings mostly (50% of the sites that we examined) using the properties "title", "jobLocation", and "description".

I think that the reason for this are the schemata used by most of today's HR databases. All of these databases are likely to have a job title and job description field, but many won't have skills, qualifications, and responsibilities fields and also the departments of the companies deliver job postings as free-text to the HR department and not nicely split into different fields.

So what do you do as a webmaster in charge of publishing your companies job postings on the Web?

You edit the PHP-script or other script that produces the HTML pages and add Schema.org markup. This is a 10 minutes job.
Convincing all the departments of your company to deliver job postings to you in a different, more structured format would be a large project and the departments are likely not to cooperate as they don't see the benefits of the whole endeavor.

So the problem is not missing documentation or that the webmaster is stupid, but that the webmaster currently cannot do anything about it.

I think the adoption path of the more specialized properties will be as follows:

1.       Many websites roughly markup their content using a minimal set of schema.org<http://schema.org> terms. This is happening now.

2.       The major search engines like Google extract "skills", "qualifications", and "responsibilities" from the free-text of the description field using NLP techniques and start providing sophisticated job search features (similar to the features provided by specialized job portals today).

3.       The departments of our example company recognize that the search engines make errors in guessing the features from the free-text and that their job postings are thus harder to find than the job postings of a competitor.

4.       Thus, they ask the HR or IT department what to about this and a process is started inside the company to capture job postings in a more structured way and to extent the current HR database with the required fields for this.

So the major driver for getting more structured data onto the Web are mainstream applications consuming it. The rich snippets provided by search engines today are a nice start, but I honestly hope that the major search engines are already working on features such as improved job search and that such features will be deployed soon.

Especially for the job market, this is beneficial for everybody. Job seekers get better market transparency as they don't need to visit different job portals anymore, but can find all job postings in a single portal (the search engine). For companies offering jobs this is also better as their add reaches more people and as they don't need to pay portals like Monster or StepStone thousands of dollar for the add anymore.

Cheers,

Chris

Received on Wednesday, 30 October 2013 18:58:13 UTC