[whatwg] Trying to work out the problems solved by RDFa from Charles McCathieNevile on 2009-01-04 (public-whatwg-archive@w3.org from January 2009)

From: Charles McCathieNevile <chaals@opera.com>
Date: Sun, 04 Jan 2009 12:49:09 +1100
Message-ID: <op.um7lz7ufwxe0ny@widsithpro.lan>
On Sun, 04 Jan 2009 03:51:53 +1100, Calogero Alex Baldacchino  
<alex.baldacchino at email.it> wrote:

> Charles McCathieNevile ha scritto:
> ... it shouldn't be too difficoult to create a custom parser, comforming  
> to RDFa spec and availing of data-* attributes...
>
> That is, since RDFa can be "emulated" somehow in HTML5 and tested  
> without changing current specification, perhaps there isn't a strong  
> need for an early adoption of the former, and instead an "emulated"  
> mergence might be tested first within current timeline.

In principle this is possible. But the data-* attributes are designed for  
private usage, and introducing a public usage means creating a risk of  
clashes that pollute RDFa data gathered this way. In other words, this is  
indeed feasible, but one would expect it to show that the data generated  
was unreliable (unless privately nobody is interested in basic terms like  
about). Such results have been used to suggest that poorly implemented  
features should be dropped, but this hypothetical case suggests to me that  
the argument is wrong, and that if in the face of reasons why the data  
would be bad people use them, one might expect better usage by formalising  
the status of such features and getting decent implementations.

>>> What is the cost of having different data use specialised formats?
>>
>> If the data model, or a part of it, is not explicit as in RDF but is  
>> implicit in code made to treat it (as is the case with using scripts to  
>> process things stored in arbitrarily named data-* attributes, and is  
>> also the case in using undocumented or semi-documented XML formats, it  
>> requires people to understand the code as well as the data model in  
>> order to use the data. In a corporate situation where hundreds or tens  
>> of thousands of people are required to work with the same data, this  
>> makes the data model very fragile.
>>
>
> I'm not sure RDF(a) solves such a problem. AIUI, RDFa just binds (xml)  
> properties and attributes (in the form of curies) to RDF concepts,  
> modelling a certain kind of relationships, whereas it relies on external  
> schemata to define such properties. Any undocumented or semi-documented  
> XML formats may lead to misuses and, thus, to unreliably modelled data,
...

> I think the same applies to data-* attributes, because _they_ describe  
> data (and data semantics) in a custom model and thus _they_ need to be  
> documented for others to be able to manipulate them; the use of a custom  
> script rather than a built-in parser does not change much from this  
> point of view.

RDFa binds data to RDF. RDF provides a well-known schema language with  
machine-processable definition of vocabularies, and how to merge  
information between them. In other words, if you get the underlying model  
for your data right enough, people will be able to use it without needing  
to know what you do.

Naturally not everyone will get their data model right, and naturally not  
all information will be reliable anyway. However, it would seem to me that  
making it harder to merge the data in the first place does not assist in  
determining whether it is useful. On the other hand, certain forms of RDF  
data such as POWDER, FOAF, Dublin Core and the like have been very  
carefully modelled, and are relatively well-known and re-used in other  
data models. Making it easy to parse this data and merge it, according to  
the existing well-developed models seems valuable.


>> Ian wrote:
>>> For search engines, I am not convinced. Google's experience is that
>>> natural language processing of the actual information seen by the  
>>> actual end user is far, far more reliable than any source of metadata.
>>> Thus from Google's perspective, investing in RDFa seems like a poorer
>>> investment than investing in natural language processing.
>>
>> Indeed. But Google is something of an edge case, since they can afford  
>> to run a huge organisation with massive computer power and many  
>> engineers to address a problem where a "near-enough" solution brings  
>> themn the users who are in turn the product they sell to advertisers.  
>> There are many other use cases where a small group of people want a way  
>> to reliably search trusted data.
>>
>
> I think the point with general purpose search engines is another one:  
> natural language processing, whereas being expensive, grants a far more  
> accurate solution than RDFa and/or any other kind of metadata can bring  
> to a problem requiring data must never need to be trusted (and, instead,  
> a data processor must be able to determine data's level of trust without  
> any external aid).

No, I don't think so. Google searches based on analysis of the open web  
are *not* generally more reliable than faceted searches over a reliable  
dataset, and in some instances are less reliable.

The point is that only a few people can afford to invest in being a  
general-purpose search engine, whereas many can afford to run a  
metadata-based search system over a chosen dataset, that responds to their  
needs (and doesn't require either publishing their data, or paying Google  
to index it).

> Since there is no "direct" relationship between the semantics expressed  
> by RDFa and the real semantics of a web page content, relying on RDFa  
> metadata would lead to widespread cheats, as it was when the keywords  
> meta tag was introduced.

Sure. There would also be many many cases of organisations using decent  
metadata, as with existing approaches. My point was that I don't expect  
Google to naively trust metadata it finds on the open web, and in the  
general case probably not even to look at it. However, Google is not the  
measure of the Web, it is a company that sells advertising based on  
information it has gleaned about users by offering them services.

So the fact that some things on the Web are not directly beneficial to  
Google isn't that important. I do not see how the presence of explicit  
metadata threatens google any more than the presence of plain text (which  
can also be misleading).

> Thus, a trust chain/evaluation mechanism (such as the use of signatures)  
> would be needed,

Indeed such a thing is needed for a general purpose search engine. But  
there are many cases where an alternative is fine. For example, T-mobile  
publish POWDER data about web pages. Opera doesn't need to believe all the  
POWDER data it finds on the Web in order to improve its offerings based on  
T-mobile's data, if we can decide how to read that specific data. Which  
can be done by deciding that we trust a particular set of URIs more than  
others. No signature necessary, beyond the already ubiquitous TLS and the  
idea that we trust people we have a relationship with and whose domains we  
know.

> My concern is that any data model requiring any level of trust to  
> achieve a good-working interoperability may address very small (and  
> niche) use cases, and even if a lot of such niche use cases might be  
> grouped in a whole category consistently addressed by RDFa (perhaps  
> beside other models), the result might not be an enough significant use  
> case fitting actual specification guidelines (which are somehow hostile  
> to (xml) extensibility, as far as I've understood them) -- though they  
> might be changed when and if really needed.

A concern of mine is that it is unclear what the required level of  
usefulness is. The "google highlight" element (once called m but I think  
it changed its name again) is currently in the spec, the longdesc  
attribute currently isn't.  I presume these facts boil down to judgement  
calls by the editor while the spec is still an early draft, but it is not  
easy to understand what information would determine whether something is  
"sufficiently important". Which makes it hard to determine whether it is  
worth the considerable investment of discussing in this group, or easier  
to just go through the W3C process of objecting later on.

cheers

Chaals

-- 
Charles McCathieNevile  Opera Software, Standards Group
     je parle fran?ais -- hablo espa?ol -- jeg l?rer norsk
http://my.opera.com/chaals       Try Opera: http://www.opera.com
Received on Saturday, 3 January 2009 17:49:09 UTC