Re: Testing Google's Rich Snippets RDFa support from Philip Taylor on 2009-09-15 (public-rdf-in-xhtml-tf@w3.org from September 2009)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Tue, 15 Sep 2009 17:09:09 +0100
To: Othar Hansson <othar@othar.com>
CC: RDFa mailing list <public-rdf-in-xhtml-tf@w3.org>
Message-ID: <4AAFBC25.8050403@cam.ac.uk>
Othar Hansson wrote on 2009-09-12:
> Thanks for the bug reports.

Thanks for your response! (Doesn't seem to have reached the CC'd 
public-rdf-in-xhtml-tf list yet, so I'll quote it in full here.)

> We should be clearer about the purpose of the preview tool.  It's to
> give webmasters a preview  of the rich snippet that we would produce
> based on the data found on the page.  As an aid to debugging, we show
> what we parsed from the page.  We should point people elsewhere if
> they want full RDFa validation.

That purpose is fine - I don't expect it to be a full RDFa processor 
displaying output triples or anything like that. But I expect the data 
it extracts from a page (as used when generating the snippet preview, 
and shown in the debugging output) to be 'compatible' with a conforming 
RDFa processor, in the sense that the same data can (in theory) be 
derived entirely by applying some transformation to the RDF triples 
generated by a conforming RDFa processor.

Firstly, if Google's processor extracts data from a page that is not 
extracted by a real RDFa processor, then people will write pages with 
incorrect/invalid RDFa (e.g. they might use a wrong namespace URI, like 
Google's own documentation did when it was first released), test it in 
Google's tool, see that the output is correct, and think that everything 
is fine and that they're supporting the RDFa standard. The rest of the 
RDF community, using real RDFa processors, will be unable to parse and 
use that incorrectly marked-up data.

As I see it, the purpose of a standard like RDFa is to ensure 
interoperability between producers and consumers in order to maximise 
the amount of data that can be extracted from the web, and this is 
compromised if some people break interoperability by doing things 
differently (particularly if it's someone prominent like Google with 
significant influence over content producers).

Secondly, if Google's processor extracts data from one page but fails on 
another page, when those pages are equivalent from the perspective of 
RDFa (generating the same set of RDF triples), then it may work for 
people who copy-and-paste examples from the documentation but it will 
mislead and confuse producers who actually understand RDFa. They will 
write something that works, modify it in a way that the RDFa 
documentation and RDFa tools say should make no difference (e.g. moving 
some text into a @content attribute, or using the <a xmlns:http="http:" 
rel="http://www..."> trick to make CURIEs that look like full URIs), and 
it will unexpectedly break in Google's processor.

Instead they'll have to learn a new (and currently undocumented) syntax 
that is the intersection of what RDFa and Google support, making it much 
more complex and more restrictive than if Google supported RDFa in a 
compatible way.


I'm not personally a proponent of RDFa and I don't have any strong 
feelings against Google using proprietary or non-RDFa markup (or proper 
RDFa) for this kind of thing; I just don't like it being promoted (by 
Google and by RDFa supporters) as "RDFa" when it suffers from these 
problems due to disregarding the standard, and it seems to me (after 
looking into the details) that it will hurt the RDFa community if the 
problems are not resolved.

At least with proprietary markup, someone could write a tool that parses 
the data into RDF alongside a normal RDFa parser, and run both parsers 
over arbitrary web pages (which would be a bit of a pain but would be 
possible). As it is now, it's impossible to write a tool that extracts 
the same data as Google without violating the RDFa spec and generating 
incorrect output from some valid RDFa pages.

> We surely have errors in our parsing (thanks for finding several:
> we'll look into these on Monday).  But we will also deviate from the
> standard in some cases to be forgiving of webmaster errors.  For
> example, we expect that some webmasters will forget the xmlns
> attribute entirely.

"we will [...] deviate from the standard" makes me believe that the 
above problems are an unavoidable consequence of Google's intentions, 
rather than just unintentional transient fixable bugs, and therefore are 
a serious concern (which is why I'm writing about it like this rather 
than just listing bugs).

Are you going to propose (or have you already proposed) these deviations 
as an update to the RDFa Recommendation (or as a new competing standard, 
or at least as a Google-hosted specification so it's documented 
somewhere)? (I'm mostly just a bystander, not an active participant in 
anything RDFa-related, so I might have missed some existing discussions 
about this.)

If not, it seems like an unexpected disregard for standards and 
interoperability. So I'm hoping that's not the case!

> --Othar
> (@google)

-- 
Philip Taylor
pjt47@cam.ac.uk
Received on Tuesday, 15 September 2009 16:09:49 UTC