[whatwg] Trying to work out the problems solved by RDFa from Calogero Alex Baldacchino on 2009-01-03 (public-whatwg-archive@w3.org from January 2009)

From: Calogero Alex Baldacchino <alex.baldacchino@email.it>
Date: Sat, 03 Jan 2009 17:51:53 +0100
Message-ID: <495F97A9.1080602@email.it>
Charles McCathieNevile ha scritto:
>>> The results of the first set of Microformats efforts were some pretty
>>> cool applications, like the following one demonstrating how a web
>>> browser could forward event information from your PC web browser to 
>>> your
>>> phone via Bluetooth:
>>>
>>> http://www.youtube.com/watch?v=azoNnLoJi-4
>>
>> It's a technically very interesting application. What has the adoption
>> rate been like? How does it compare to other solutions to the problem,
>> like CalDav, iCal, or Microsoft Exchange? Do people publish calendar
>> events much? There are a lot of Web-based calendar systems, like 
>> MobileMe
>> or WebCalendar. Do people expose data on their Web page that can be used
>> to import calendar data to these systems?
>
> In some cases this data is indeed exposed to Webpages. However, 
> anecdotal evidence (which unfortunately is all that is available when 
> trying to study the enormous collections of data in private intranets) 
> suggests that this is significantly more valuable when it can be done 
> within a restricted access website.
>
> ...
>>> In short, RDFa addresses the problem of a lack of a standardized
>>> semantics expression mechanism in HTML family languages.
>>
>> A standardized semantics expression mechanism is a solution. The lack 
>> of a solution isn't a problem description. What's the problem that a
>> standardized semantics expression mechanism solves?
>
> There are many many small problems involving encoding arbitrary data 
> in pages - apparently at least enough to convince you that the data-* 
> attributes are worth incorporating.
>
> There are many cases where being able to extract that data with a 
> simple toolkit from someone else's content, or using someone else's 
> toolkit without having to tell them about your data model, solves a 
> local problem. The data-* attributes, because they do not represent a 
> formal model that can be manipulated, are insufficient to enable 
> sharing of tools which can extract arbitrary modelled data.
>

That's because the data-* attributes are meant to create custom models 
for custom use cases not (necessarily) involving interchange and (let me 
say) "agnostic extraction" of data. However, data-* attributes might be 
used to "emulate" support for RDFa attributes, so that each one might be 
mapped to, let's say, a "data-rdfa-<attribute>" one and viceversa (I 
don't think "data-rdfa-about" vs "about" would make a great difference, 
at least in a test phase, since it wouldn't be much different from 
"rdfa:about", which might be used to embed RDFa attributes in a somewhat 
xml language (e.g. an "external" markup embedded in a xhtml document 
through the extension mechanism)).

Since it seems there are several problems which may be addressed (beside 
other, more custom models) by RDFa for organization-wide internal use 
and intranet publication, without the explicit requirement of external 
interchange, when both HTML5 specific features and RDFa attributes are 
felt as necessary, it shouldn't be too difficoult to create a custom 
parser, comforming to RDFa spec and availing of data-* attributes, to be 
plugged in a certain browser supporting html5 (and data-*) for internal 
test first, then exposed to the community, so that html5+rdfa can be 
tested on a wider scale (especially once alike parsers are provided for 
all main browsers), looking for a widespread adoption to point out an 
effective need to merge RDFa into HTML5 spec (or to standardize an 
approach based on data-* attributes).

That is, since RDFa can be "emulated" somehow in HTML5 and tested 
without changing current specification, perhaps there isn't a strong 
need for an early adoption of the former, and instead an "emulated" 
mergence might be tested first within current timeline.

>> What is the cost of having different data use specialised formats?
>
> If the data model, or a part of it, is not explicit as in RDF but is 
> implicit in code made to treat it (as is the case with using scripts 
> to process things stored in arbitrarily named data-* attributes, and 
> is also the case in using undocumented or semi-documented XML formats, 
> it requires people to understand the code as well as the data model in 
> order to use the data. In a corporate situation where hundreds or tens 
> of thousands of people are required to work with the same data, this 
> makes the data model very fragile.
>

I'm not sure RDF(a) solves such a problem. AIUI, RDFa just binds (xml) 
properties and attributes (in the form of curies) to RDF concepts, 
modelling a certain kind of relationships, whereas it relies on external 
schemata to define such properties. Any undocumented or semi-documented 
XML formats may lead to misuses and, thus, to unreliably modelled data, 
and it is not clear to me how just creating an explicit relationship 
between properties is enough to ensure that a property really represents 
a subject and not a predicate or an object (in its wrongly documented 
schema), if the problem is the correct definition of the properties 
themselves. Perhaps it is enough to parse them, and perhaps it can 
"inspire" a better definition of the external schemata (if the RDFa 
"vision" of data as triples is suitable for the effective data to 
model), but if the problem is the right understanding of "what 
represents what" because of a lack in documentations, I think that's 
something RDF/RDFa can't solve.

I think the same applies to data-* attributes, because _they_ describe 
data (and data semantics) in a custom model and thus _they_ need to be 
documented for others to be able to manipulate them; the use of a custom 
script rather than a built-in parser does not change much from this 
point of view.


> [not clear what the context was here, so citing as it was]
>>> > I don't think more metadata is going to improve search engines. In
>>> > practice, metadata is so highly gamed that it cannot be relied upon.
>>> > In fact, search engines probably already "understand" pages with far
>>> > more accuracy than most authors will ever be able to express.
>>>
>>> You are correct, more erroneous metadata is not going to improve search
>>> engines. More /accurate/ metadata, however, IS going to improve search
>>> engines. Nobody is going to argue that the system could not be gamed. I
>>> can guarantee that it will be gamed.
>>>
>>> However, that's the reality that we have to live with when introducing
>>> any new web-based technology. It will be mis-used, abused and 
>>> corrupted.
>>> The question is, will it do more good than harm? In the case of RDFa
>>> /and/ Microformats, we do think it will do more good than harm.
>>
>> For search engines, I am not convinced. Google's experience is that
>> natural language processing of the actual information seen by the actual
>> end user is far, far more reliable than any source of metadata. Thus 
>> from
>> Google's perspective, investing in RDFa seems like a poorer investment
>> than investing in natural language processing.
>
> Indeed. But Google is something of an edge case, since they can afford 
> to run a huge organisation with massive computer power and many 
> engineers to address a problem where a "near-enough" solution brings 
> themn the users who are in turn the product they sell to advertisers. 
> There are many other use cases where a small group of people want a 
> way to reliably search trusted data.
>

I think the point with general purpose search engines is another one: 
natural language processing, whereas being expensive, grants a far more 
accurate solution than RDFa and/or any other kind of metadata can bring 
to a problem requiring data must never need to be trusted (and, instead, 
a data processor must be able to determine data's level of trust without 
any external aid). Since there is no "direct" relationship between the 
semantics expressed by RDFa and the real semantics of a web page 
content, relying on RDFa metadata would lead to widespread cheats, as it 
was when the keywords meta tag was introduced. Thus, a trust 
chain/evaluation mechanism (such as the use of signatures) would be 
needed, and so a general purpose search engine relying on RDFa would 
seem to be working more as a search directory, where human beings 
analyse content to classify pages, resulting in a more accurate result, 
but also in a smaller and very slowly growing database of classified 
sites (since obviously there will always be far more sites not caring of 
metadata and/or of making their metadata trusted, than sites using 
trusted RDFa metadata).

(the same reasoning may apply to a local search made by a browser in its 
local history: results are reliable as far as the expressed semantics is 
reliable, that is as far as its source is reasonably trusted, which may 
not be true in general - in general, misuses and deliberate abuses 
whould be the most common case without a trust evaluation mechanism, 
which, in turn, would restrict the number of pages where the presence of 
rdf(a) metadata is really helpful).

My concern is that any data model requiring any level of trust to 
achieve a good-working interoperability may address very small (and 
niche) use cases, and even if a lot of such niche use cases might be 
grouped in a whole category consistently addressed by RDFa (perhaps 
beside other models), the result might not be an enough significant use 
case fitting actual specification guidelines (which are somehow hostile 
to (xml) extensibility, as far as I've understood them) -- though they 
might be changed when and if really needed.

Best regards,
Alex
 
 
 --
 Caselle da 1GB, trasmetti allegati fino a 3GB e in piu' IMAP, POP3 e SMTP autenticato? GRATIS solo con Email.it http://www.email.it/f
 
 Sponsor:
 Con Meetic trovi milioni di single, iscriviti adesso e inizia subito a fare nuove amicizie
 Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=8290&d=3-1
Received on Saturday, 3 January 2009 08:51:53 UTC