Re: Identification of RDFa content from Ivan Herman on 2006-11-25 (public-rdf-in-xhtml-tf@w3.org from November 2006)

From: Ivan Herman <ivan@w3.org>
Date: Sat, 25 Nov 2006 10:15:39 +0100
To: mark.birbeck@x-port.net
CC: Ben Adida <ben@mit.edu>, public-rdf-in-xhtml task force <public-rdf-in-xhtml-tf@w3.org>
Message-ID: <456809BB.3010108@w3.org>
Hi Mark,

your points are well taken and no disagreement anywhere. Let us say: we
are in wild agreement in the fundamental approach, ie, that a (X)HTML
document *is* RDFa!

For me the issue is really a question of efficiency rather than anything
else. The use case below may be a bit artificial, but nevertheless. Let
us say I have a foaf file (in good old RDF/XML format), which includes
the following statement:

<rdfs:seeAlso rdf:resource="http://www.w3.org/People/Ivan/"/>

And I feed this into one of those programs that try to include all
seeAlso-s into its final triple store. Tabulator is an example or, say,
the server set by Chris Bizer in Berlin is another one (I will use
'tabulator' as a generic example)

What would a tabulator do? It would then

- look at that URI, see from the return header that it is not
application/rdf+xml or n3 (I am not sure there is a mime type for that
one...). It sees that it is some form of HTML (let us not go into the
details of which one)

- It has then the choice of:

    1. look at the start of the document (essentially, sniff it) to see
if there is a GRDDL profile; in this case it will GRDDL it and add the
result to its triple store
    2. *parse* the whole document with an RDFa parser to, possibly, get
an empty triple store
    3. if both of these steps yield an empty document, than it will give
up extending its triple store and add an external link to its user
interface.

My issue is simply the *efficiency* of step #2. A tool like tabulator
already suffers from efficiency (o.k, it is a student project at MIT,
and stuff like that, but we should not underestimate the issue). If it
wants to be a tool recognizing RDFa content, I do not think it can do
anything else than #2, but that may lead to a large number of
unnecessary parsing [there is a whole discussion on whether seeAlso is
the best tool for such effects, but let us leave that aside].

That is why I think a non-obligatory 'marker' is or may be useful. At
the moment, I do not see any other solution than a profile.
Alternatively, people will use the GRDDL mechanism to retrieve RDFa and
that is all what a tabulator would accept...

*I see the danger*, do not take me wrong! That would mean that, though
the marker is conditional, people would feel it necessary to have a tool
like tabulator work properly. But one could also have an optimization
flag in tabulator, ie, it would do #2 above, but would switch to
sniffing the profile first.

Clearly, we are brainstorming here, and I am trying to find a solution.
My immediate concern, as I guess is yours, too, is to provide for a
rapid mechanism for the usage of RDFa...

Cheers

Ivan

Mark Birbeck wrote:
> Hi Ivan,
> 
> This is an interesting issue, and the problem has nothing to do with
> modules, XHTML 1.2, XHTML 2, or anything like that. The key difficulty
> is that RDFa has been specifically designed to beef up the metadata
> features that HTML already has, and as a consequence, all HTML
> documents are already RDFa-compliant.
> 
> Take something like this (in HTML):
> 
>  <head>
>    <title>My site</title>
>    <link rel="next" href="...">
>    <link rel="previous" href="...">
>  </head>
> 
> This tells us that the current document has 'next' and 'previous'
> documents, and is simple, standard, HTML. Now, there's no reason at
> all why some processing software shouldn't store the following
> information about that document:
> 
>  <> h:next <...> .
>  <> h:previous <...> .
> 
> Now we can use SPARQL to find all documents that refer to some other
> document, and even documents that are the last in a chain. The fact
> that this document 'contains' RDFa is down to the processor, and not
> down to the author--it's in the eye of the beholder :).
> 
> I've said this before, and it's generally been met with the claim that
> we're 'hijacking' people's data. Hopefully, an example like this shows
> that we're certainly not 'forcing' documents to be RDFa when people
> don't want them to be; what we're doing is saying that HTML documents
> already have metadata, and RDFa defines some rules about how to treat
> that metadata from an RDF standpoint. The fact that these rules are
> entry-level RDFa is of course fortuitous, since it means that you can
> also add far richer metadata later on.
> 
> So, what I'm interested to hear is a use case for something that
> indicates the presence of RDFa in a document. It's pointless having it
> *in* the document, since as I've shown, an RDFa parser is not going to
> 'fail' if it processes an HTML document with limited metadata, so you
> could just process all documents that way.
> 
> But you could say that we don't want to process such documents, and
> therefore the indicator would need to be outside, since you want to
> save the cost of retrieval. (If you have to retrieve it to find out
> whether to process it, as Ben says you might as well just go ahead and
> process it.) But then you're into the problem that FoaF has--how do
> you bootstrap the whole thing? Do we maintain a list of
> RDFa-conformant documents?
> 
> To put this another way--indicating that a document 'is' RDFa is
> pointless, since all documents 'just are', which means that any
> indicator we devise is only playing the role of pointing out that some
> document was intended to be part of some community of
> specially-prepared documents that have been crafted to contain useful
> metadata. That may or may not be useful--I couldn't say, but I just
> wanted to clarify that some stamp of approval is not the same as
> indicating the class of the document (the latter being unnecessary).
> 
> All the best,
> 
> Mark
> 
> 
> On 23/11/06, Ivan Herman <ivan@w3.org> wrote:
> 
>> Hm. If we want a quick usage and spread of RDFa, then this may not be
>> fully satisfactory at least in my view. Nobody knows when XHTML 1.2 will
>> be published as a Rec, let alone XHTML 2.0 (the group's charter has just
>> been sent to the AC, ie, there is not group yet!). What happens in the
>> meantime?
>>
>> My hope is that the XHTML1.x RDFa module, as well as the final technical
>> spec, will be published way before the full XHTML1.2, and that we can
>> start using RDFa big time and quickly. Using a (possibly optional)
>> profile tag might help that.
>>
>> Of course, we could rely on GRDDL and, say, Fabien's XSLT script [as an
>> aside: we should have a clear test set; Fabien's script, for example,
>> does not produce the same result as Elias' one, I think there are
>> missing features...]. However, if I take an environment like Redland,
>> that means that it would have to go and execute an 'outsider' script
>> every time it wants to retrieve RDFa content (which also means that it
>> would not work off-line) whereas if it knew via a profile that this is
>> RDFa, it could parse the file right away and locally.
>>
>> Bottomline: I am still not convinced:-(; and I do not see harm in
>> declaring a separate profile...
>>
>> Ivan
>>
>> Ben Adida wrote:
>> > Ivan,
>> >
>> > Sorry for the delayed response here.
>> >
>> > RDFa is meant to be a natural part of XHTML. In other words,
>> declaring a
>> > document to be XHTML 1.2 or 2.0 is enough to make a parser look for
>> > RDFa. This may be done by specifying a GRDDL profile in the XHTML 1.2
>> > and 2.0 namespace documents.
>> >
>> > Of course, parsers may choose to be more promiscuous than that and look
>> > inside XHTML 1.1 and 1.0 if they so choose...
>> >
>> > -Ben
>> >
>> > Ivan Herman wrote:
>> >
>> >>This may have been discussed before, in which case apologies. I have
>> not
>> >>seen a reference to it in the latest draft.
>> >>
>> >>The question: how does one discover that an XHTML file is 'RDFa-d'? The
>> >>issue stroke me as a result of some discussions lately around the
>> >>Tabulator[1] and Chris Bizer's announcement[2]. In both cases one can
>> >>see engines that are able to make an indirect step, so to say; ie, they
>> >>get a URI to a traditional site, but they can deduce the presence of a
>> >>corresponding RDF data which they can add to their graph they build and
>> >>explore. Examples are the <link references to RDF data, or the GRDDL
>> >>profile.
>> >>
>> >>Hence the question again: how does an automatic procedure 'know'
>> that an
>> >>XHTML file contains RDFa encoded extra RDF data? Of course, a processor
>> >>could RDFa process *all* XHTML file it gets hold of, but it may be
>> worth
>> >>adding some standard notification. Also, if such identification was
>> >>around, the same URI could be used both for human consumption and
>> for an
>> >>RDFa-aware RDF environment.
>> >>
>> >>One would think of a profile attribute or is some sort of a special and
>> >>predefined <link>... whichever. Something would be good.
>> >>
>> >>Any thoughts?
>> >>
>> >>Ivan
>> >>
>> >>
>> >>[1] http://dig.csail.mit.edu/breadcrumbs/node/165
>> >>[2] http://lists.w3.org/Archives/Public/semantic-web/2006Oct/0065.html
>> >>
>> >
>> >
>>
>> -- 
>>
>> Ivan Herman, W3C Semantic Web Activity Lead
>> URL: http://www.w3.org/People/Ivan/
>> PGP Key: http://www.cwi.nl/%7Eivan/AboutMe/pgpkey.html
>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>
>>
>>
> 
> 

-- 

Ivan Herman, W3C Semantic Web Activity Lead
URL: http://www.w3.org/People/Ivan/
PGP Key: http://www.cwi.nl/%7Eivan/AboutMe/pgpkey.html
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Saturday, 25 November 2006 09:15:56 UTC