- From: Christian Bizer <chris@bizer.de>
- Date: Wed, 30 Oct 2013 09:57:30 +0100
- To: "'Peter Patel-Schneider'" <pfpschneider@gmail.com>
- Cc: "'Guha'" <guha@google.com>, "'Martin Hepp'" <martin.hepp@unibw.de>, "'W3C Vocabularies'" <public-vocabs@w3.org>
- Message-ID: <061e01ced54e$10d6c8c0$32845a40$@bizer.de>
Hi Peter, while I agree that better documentation and examples are always a plus, I think the problem lies elsewhere. Let's take the example of JobPostings again. Schema.org defines lots of nice properties for describing job postings including "skills", "qualifications", and "responsibilities". But these properties are not used by the data providers which describe job postings mostly (50% of the sites that we examined) using the properties "title", "jobLocation", and "description". I think that the reason for this are the schemata used by most of today's HR databases. All of these databases are likely to have a job title and job description field, but many won't have skills, qualifications, and responsibilities fields and also the departments of the companies deliver job postings as free-text to the HR department and not nicely split into different fields. So what do you do as a webmaster in charge of publishing your companies job postings on the Web? You edit the PHP-script or other script that produces the HTML pages and add Schema.org markup. This is a 10 minutes job. Convincing all the departments of your company to deliver job postings to you in a different, more structured format would be a large project and the departments are likely not to cooperate as they don't see the benefits of the whole endeavor. So the problem is not missing documentation or that the webmaster is stupid, but that the webmaster currently cannot do anything about it. I think the adoption path of the more specialized properties will be as follows: 1. Many websites roughly markup their content using a minimal set of schema.org terms. This is happening now. 2. The major search engines like Google extract "skills", "qualifications", and "responsibilities" from the free-text of the description field using NLP techniques and start providing sophisticated job search features (similar to the features provided by specialized job portals today). 3. The departments of our example company recognize that the search engines make errors in guessing the features from the free-text and that their job postings are thus harder to find than the job postings of a competitor. 4. Thus, they ask the HR or IT department what to about this and a process is started inside the company to capture job postings in a more structured way and to extent the current HR database with the required fields for this. So the major driver for getting more structured data onto the Web are mainstream applications consuming it. The rich snippets provided by search engines today are a nice start, but I honestly hope that the major search engines are already working on features such as improved job search and that such features will be deployed soon. Especially for the job market, this is beneficial for everybody. Job seekers get better market transparency as they don't need to visit different job portals anymore, but can find all job postings in a single portal (the search engine). For companies offering jobs this is also better as their add reaches more people and as they don't need to pay portals like Monster or StepStone thousands of dollar for the add anymore. Cheers, Chris Von: Peter Patel-Schneider [mailto:pfpschneider@gmail.com] Gesendet: Mittwoch, 30. Oktober 2013 00:23 An: Christian Bizer Cc: Guha; Martin Hepp; W3C Vocabularies Betreff: Re: schema.org and proto-data, was Re: schema.org as reconstructed from the human-readable information at schema.org The first kind of behaviour described below is, perhaps, a flaw, which can be fairly easily fixed by turning the text into a simple item. The main wrinkle is whether the text becomes the name of the item or a description of the item. The second kind of behaviour is not a flaw at all, in that there is good data on the pages. Of course, one might want to do better by analyzing the text on the page to do better, but that isn't required. And why stop at the text in the schema.org property values? As suggested elsewhere, another way to improve the situation here is to have better examples so that content providers can more easily produce better data. The biggest issue with determining what works and what doesn't in schema.org is that there is no real description of either the data model of schema.org or the meaning (informal or formal) of data in this model. Hopefully this will be forthcoming shortly. (I have a short document on what I would hope that the result looks like.) peter On Tue, Oct 29, 2013 at 12:57 PM, Christian Bizer <chris@bizer.de> wrote: Hi Peter, if you want two concrete examples illustrating the "quality of the understanding between the many minds (developers) in this eco-system" Martin is talking about, here they are: 1. http://schema.org/JobPosting clearly states that the values of the property "hiringOrganization" should be of type "Organization". Nevertheless 40% of the Scehma.org JobPosting instances that we found on the Web did contain a simple string as the value of this property. 2. http://schema.org/Product defines over 20 properties that can be used to describe products. Nevertheless out of the 14.000 websites that we found to use this class, around 50% only use three properties to describe products: Name, Description and Image. The first flaw can easily be fixed when pre-processing Schema.org data. Fixing the second problem requires some more sophisticated NLP techniques to guess product features from the free text in the fields Name and Description, for instance if you want to do identity resolution in order to find out which of the 14.000 websites is offering a specific iPhone. No magic and no non-deterministic algorithms, just the normal dirty stuff that makes data integration successful, with the small, but relevant difference that you know because of the markup that you are looking at product descriptions and not at arbitrary web pages. If you (or anybody else) wants to find out more about how schema.org is used in the wild, you can download 1.4 billion quads Schema.org data originating from 140,000 websites from http://webdatacommons.org/ or take a look at https://github.com/lidingpku/iswc-archive/raw/master/paper/iswc-2013/8219001 7-deployment-of-rdfa-microdata-and-microformats-on-the-web-a-quantitative-an alysis.pdf which gives you some basic statistics about commonly used classes and properties. So no need to be a mayor search engine to explore this space and get an understanding about the kind of knowledge modeling that is understood by average webmasters ;-) Cheers, Chris
Received on Wednesday, 30 October 2013 08:57:57 UTC