Re: Modified proposal for 'provenance triple', ISSUE-110 from Ivan Herman on 2011-09-01 (public-rdfa-wg@w3.org from September 2011)

From: Ivan Herman <ivan@w3.org>
Date: Thu, 1 Sep 2011 14:17:11 +0200
To: Niklas Lindström <lindstream@gmail.com>
Cc: Gregg Kellogg <gregg@kellogg-assoc.com>, W3C RDFWA WG <public-rdfa-wg@w3.org>
Message-Id: <966406DD-EFFF-4EC6-AFAA-92E8F33472E7@w3.org>
Niklas,

forget about URIRef, named graphs, quads, etc....

As far as I know, if you have a http://www.example.org/bla.ttl file containing the following triples:

<> a:b <http://example.org> .

Then <> stands for the base of the containing turtle file, ie, it stands for <http://www.example.org/bla.ttl>. That turtle file may be generated from the RDFa file http://www.example.org/bla.html but the base URI in that RDFa file is different...

Now... I realize that if all this is done through a service returning some turtle content, I am not sure what the base uri means in the turtle serialization. But one thing is sure: this is _not_ the same as the base URI of the original RDFa file (unless of course a @base is put into the file explicitly)

Ivan


On Sep 1, 2011, at 14:06 , Niklas Lindström wrote:

> Ivan, Gregg,
> 
> I'm quite sure that Gregg is correct. Ivan, you say "URI referring to
> the processor graph". But there is no predefined means of determining
> the IRI for a graph *within* a graph. Any RDF format (including RDFa)
> which only deals with triples has no means to even express what the
> "containing graph" is (in the quad sense). You may of course express
> information about the document (base) URI though.
> 
> Correct me if I'm wrong, but since the conceptual RDF model doesn't
> include quads (only reification), it isn't even currently clear what
> "graphs of graphs" are, apart from the instrumental approach taken by
> e.g. SPARQL to express how you can store and query different contexts.
> 
> Anyway, there is no special meaning in RDF/XML to rdf:about="", in
> Turtle to <>, nor in RDFa to about="" (or href="", resource=""). They
> are syntactic mechanisms of expressing an empty relative IRI, which by
> a processor turning this syntax into triples *must* (AFAIK) resolve
> against the document base to produce an absolute IRI. All these
> syntaxes have optional means of supplying this base, and processors
> should by default use the URL (commonly a http or file URI), System ID
> or similar, and also provide a means to programmatically supply the
> base URI.
> 
> So I'm a bit lost here I'm afraid, as to what you mean with <>, Ivan,
> if you *don't* mean the base URI.
> 
> .. The fact that RDFLib actually preserves URIRef("") as a kind of
> "absolute relative reference" seems like a bug, or at most an esoteric
> feature to preserve a syntactic form which doesn't represent any valid
> RDF concept.
> 
> Now, I'm not saying that the topic itself is unimportant. I've dealt
> with it a lot when storing data in quad stores -- regularly creating
> named graphs based on input document URIs, and relating the named
> graph IRI to this input source (with e.g. dc:source or
> foaf:primaryTopic). In this way, a user of an RDFa processor may store
> the resulting triples into a named graph within e.g. a quad store. And
> if an RDF API supports named graphs (and graphs of graphs), the
> resulting graph from an RDFa document can reasonably be named (with a
> IRI) and a triple be added relating this named graph to the source
> document IRI. But this mechanism of minting graph IRIs and adding data
> about them (e.g. relating them to the source document(s)) is beyond
> what RDFa should specify.
> 
> (It's not uncommon AFAIK to use the actual document IRI for this in
> SPARQL, albeit this is logically conflating the document and the
> graph.)
> 
> In any case, the RDFa syntax is a syntax for RDF triples, and not
> quads, so it cannot express facts about the relationship (if any)
> between a named graph and any of the resources described therein.
> Neither should it. Named graphs and provenance is orthogonal to all
> triple syntaxes, and should be kept separate from these.
> 
> Best regards,
> Niklas
> 
> 
> 
> On Tue, Aug 30, 2011 at 8:58 AM, Ivan Herman <ivan@w3.org> wrote:
>> 
>> On Aug 30, 2011, at 07:25 , Gregg Kellogg wrote:
>> 
>>> On Aug 29, 2011, at 5:56 AM, Ivan Herman wrote:
>>> 
>>>> After our discussion and the last telco, and subsequent emails, I would like to modify the proposal.
>>>> 
>>>> Proposal: for each RDFa source, the processor graph should contain one triple of the sort
>>>> 
>>>> - subject: URI referring to the processor graph (typically <> in Turtle, or @about="" in RDF/XML, though implementation MAY define a specific URI for that purpose)
>>>> - predicate: http://www.w3.org/ns/rdfa#hasSource (see also discussion below)
>>>> - object: the initial value of the base URI, as defined in 7.2 of the RDFa Core document
>>> 
>>> Processor Graph? I thought we had discussed placing it in the default graph.
>> 
>> I am very sorry. Yes, I meant the default graph...
>> 
>>> 
>>> As I discussed before, <> or @about="" end up resolving to the document's IRI or html>head>base, as they describe relative IRIs. It seems that what we need is an empty IRI output, so that another processor encountering a serialization of the original document will see that the document at a new IRI continues to describe the original location. Consider the following:
>>> 
>>> <html>
>>>   <head>
>>>     <base href="http://example.org/original"/>
>>>   </head>
>>>   <body about="">
>>>     <p property="dc:title">Document Title</p>
>>>   </body>
>>> </html>
>>> 
>>> This will generate the following:
>>> 
>>> @base <http://example.org/original> .
>>> <> dc:title "Document Title" ; rdfa:hasSource <> .
>> 
>> Well... if this is the way you generate then of course there is an issue. But that is a serialization problem. On the RDF concept level there is no such thing as a relative URI, only absolute. Without the @base turtle directive, this code
>> 
>> <http://example.org/original> dc:title "Document Title" ; rdfa:hasSource <http://example.org/original> .
>> 
>> which is of course not what you would generate but, instead
>> 
>> <http://example.org/original> dc:title "Document Title" .
>> <> rdfa:hasSource <http://example.org/original> .
>> 
>> This just shows that the usage of @base _in the serialization_ might indeed be misleading.
>> 
>> 
>> 
>>> 
>>> What you might want instead would be the following:
>>> 
>>> <> rdfa:hasSource <http://example.org/original> .
>>> <http://example.org/original> dc:title "Document Title" .
>>> 
>>> The problem is, that as soon as the document is parsed, <> is given an actual URI (the base of the document being parsed), so I don't quite see how we accomplish this.
>>> 
>>>> I have chosen the simplest possible way for the predicate URI, namely to define one for ourselves, which may not be the best. Ideas that came up during the discussion
>>>> 
>>>> - powder:describedby : but is it correct that the RDF content 'describes' the HTML content? THat may not necessarily be the case, it may give additional data that is not in the HTML
>>>> 
>>>> - foaf:primaryTopic (Virtuoso seems to use that): "property relates a document to the main thing that the document is about.", says the foaf spec; this is, in my view, closer than powder:described by
>>> 
>>> I think this is most appropriate.
>> 
>> As I said, I am not 100% happy with this, but I can live with it:-)
>> 
>> 
>> Cheers
>> 
>> Ivan
>> 
>> 
>>> 
>>>> - dcterms has a provenance property, but its range is defined as a 'ProvenanceStatement', which would then create (via RDFS) an extra type information on the original data, and I do not think that is fine
>>>> 
>>>> - The provenance vocabulary (http://purl.org/net/provenance/ns#) also has some predicates but, just as dcterms, it contains a number of range specification that yields extra types on the original base URI. I am not sure that is o.k. If we disregard that, then prv:accessedResource is probably the best one[1], it generates a type information of 'internet Resource'[2], which is fairly harmless. The problem is whether prv is stable enough for a Rec, though.
>>>> 
>>>> - The draft of the provenance model of the Prov WG seems to have a hasOriginalSource predicate (in section 6.4), but I am not sure whether this is stable.
>>>> 
>>>> 
>>>> The stable thing is to use our own predicate, and maybe define a sub-property relationship later when the provenance WG's terms gel. Alternatively, we can ask the Prov WG for their advice. I can live with primaryTopic, but it does not feel _really_ right either.
>>>> 
>>>> Ivan
>>>> 
>>>> 
>>>> 
>>>> [1] http://trdf.sourceforge.net/provenance/ns.html#accessedResource
>>>> [2] http://ontologydesignpatterns.org/ont/web/irw.owl#WebResource
>>>> [3] http://dvcs.w3.org/hg/prov/raw-file/default/model/ProvenanceModel.html
>>>> 
>>>> ----
>>>> Ivan Herman, W3C Semantic Web Activity Lead
>>>> Home: http://www.w3.org/People/Ivan/
>>>> mobile: +31-641044153
>>>> PGP Key: http://www.ivan-herman.net/pgpkey.html
>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> ----
>> Ivan Herman, W3C Semantic Web Activity Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> PGP Key: http://www.ivan-herman.net/pgpkey.html
>> FOAF: http://www.ivan-herman.net/foaf.rdf
>> 
>> 
>> 
>> 
>> 
>> 
>> 


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
PGP Key: http://www.ivan-herman.net/pgpkey.html
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Thursday, 1 September 2011 12:17:37 UTC