Re: dataset syntax metadata from Sandro Hawke on 2012-09-26 (public-rdf-wg@w3.org from September 2012)

From: Sandro Hawke <sandro@w3.org>
Date: Wed, 26 Sep 2012 14:05:22 -0400
To: Lee Feigenbaum <lee@thefigtrees.net>
CC: RDF WG <public-rdf-wg@w3.org>
Message-ID: <506343E2.2070309@w3.org>
On 09/26/2012 01:54 PM, Lee Feigenbaum wrote:
> On 9/26/2012 1:48 PM, Sandro Hawke wrote:
>> On 09/26/2012 12:50 PM, Lee Feigenbaum wrote:
>>> I'm not sure if this is at all helpful input, but here's how we handle
>>> metadata -- in general -- in Anzo. Pat, you may avert your eyes
>>> because the semantics are inconsistent at best.
>>>
>>
>> :-)  Thanks for the details...
>>
>>> A couple of "regular" named graphs
>>>
>>> <p1> { <p1> a ex:Person ; foaf:name "Lee" ...  }
>>> <p2> { <p2> a ex:Person ; foaf:name "Lynn" ... }
>>>
>>> Named graphs have corresponding "metadata" graphs
>>>
>>> <mdg1> { <mdg1> a anzo:MetadataGraph . <p1> a anzo:NamedGraph ;
>>> anzo:hasMetadataGraph <mdg1> ; anzo:createdBy ... ;
>>> anzo:lastModifiedBy ... ; anzo:lastModifiedAt ... ; ... }
>>> <mdg2> { <mdg2> a anzo:MetadataGraph . <p2> a anzo:NamedGraph ;
>>> anzo:hasMetadataGraph <mdg2> ; anzo:createdBy ... ;
>>> anzo:lastModifiedBy ... ; anzo:lastModifiedAt ... ; ... }
>>>
>>> We also have first-class datasets, that are represented roughly like:
>>>
>>> <ds1> { <ds1> a anzo:Dataset ; anzo:hasDefaultGraph <p1> ;
>>> anzo:hasNamedGraph <p1>, <p2> }
>>>
>>> Of course, <ds1> is also a regular named graph, so there's a
>>> corresponding metadata graph with metadata about the dataset:
>>>
>>> <mdg3> { <mdg3> a anzo:MetadataGraph . <ds1> a anzo:NamedGraph ;
>>> anzo:hasMetadataGraph <mdg3> ; anzo:createdBy ... ;
>>> anzo:lastModifiedBy ... ; anzo:lastModifiedAt ... ; ... }
>>>
>>> Among other things, we use these datasets directly within SPARQL by
>>> extending SPARQL with a FROM DATASET clause:
>>>
>>> SELECT ...
>>> FROM DATASET <ds1>
>>> WHERE { ... }
>>>
>>> ...which would be equivalent in this example to
>>>
>>> SELECT ...
>>> FROM <p1>
>>> FROM NAMED <p1>
>>> FROM NAMED <p2>
>>> WHERE { ... }
>>>
>>> When we import TriG, we generally are just doing either a replace or
>>> an add on the data in the named graphs in the TriG file. We generally
>>> don't automatically create anzo:Dataset's based on the contents of a
>>> particular TriG file. Instead, if we were exporting and then importing
>>> a dataset, we'd just include the <ds1> graph in our export so we'd
>>> have it back again in an import in the future.
>>>
>>> Regarding your question (a), Sandro, you can always find the metadata
>>> graph for a particular graph (including a dataset graph) simply by
>>> querying for the anzo:hasMetadataGraph triple.
>>>
>>
>> What if I put some anzo:hasMetadataGraph triples in my [other-vendor]
>> SPARQL system, then told Anzo to incorporate that data into my corporate
>> processing system.   That could really confuse the system, right?
>
> It could. I think we block that and some other system-managed 
> predicates at import. But really, unless it's malicious, there's no 
> cause for someone to do that. (So we protect against the malicious 
> case, and don't concern ourselves with the incidental case that is 
> highly unlikely.)
>
>> In your commercial environment I guess that's not a big problem -- you
>> can just say "well, don't do that!".    Or do you  support the idea of
>> from-the-wild data feeds, which are then filtered and queried? What if
>> some of those accidentally or maliciously had hasMetadataGraph triples
>> in them?
>
> I guess I already answered this -- we protect against the malicious 
> case and the accidental case just... doesn't happen. It's the social 
> benefits of naming things with URIs -- you can be pretty sure that if 
> two people are using the same URI in good faith that they mean the 
> same thing.
>

Right, but...

> Lee
>
>> I suppose you could block those on import, but that
>> wouldn't work for other use cases, where you're trying to exchange
>> datasets with metadata.
>

For instance, exchanging the results of a web crawl.    That might quite 
reasonably, non-maliciously contain hasMetadataGraph triples inside the 
graphs; meanwhile, the crawler needs to communicate metadata to the 
client.      This is where we need @meta or use-the-default-graph or 
something, yes?

     -- Sandro

>>
>>> Anyway, for what it's worth.
>>>
>>
>> It is nice to be grounded in reality.   Plus, Anzo is cool.
>>
>>        - s
>>
>>> Lee
>>>
>>> On 9/26/2012 8:53 AM, Sandro Hawke wrote:
>>>> I'm surprised at some of the responses about the metadata questions
>>>> in my "Dataset Syntax - checking for consensus" email [1].
>>>>
>>>> When people publish RDF for real, don't they usually put some triples
>>>> in it which indicates who created it, when it was created, and maybe
>>>> why?   Maybe some folks don't do this, but many people consider this
>>>> an essential practice.   My sense is that every computer format
>>>> either has a metadata mechanism built into it, or one somehow gets
>>>> hacked in later (like the javadoc conventions). In a few cases (like
>>>> the Adobe formats) that metadata is expressed in RDF.
>>>>
>>>> When people publish an RDF dataset, aren't they going to want to do
>>>> the same thing?
>>>>
>>>> Yes, sometimes you can just throw that metadata into a named graph,
>>>> but what if (a) you don't get a chance to tell the consumer which
>>>> named graph you put it in, and (b) some named graphs are
>>>> opaque/untrustred, perhaps because they contain old information or
>>>> information from other souces (eg a Web Crawl).    (While these might
>>>> not be the cases you work with, it seems to me they'll be quite
>>>> common if this syntax ever catches on.)
>>>>
>>>> Folks who are not convinced we need a metadata mechanism -- how do
>>>> you imagine solving this problem?  How can someone reading a
>>>> serialized dataset figure out which triples are the metadata?
>>>>
>>>>       -- Sandro
>>>>
>>>>
>>>>
>>>> [1] 
>>>> http://lists.w3.org/Archives/Public/public-rdf-wg/2012Sep/0249.html
>>>>
>>>>
>>>
>>>
>>
>>
>
Received on Wednesday, 26 September 2012 18:05:31 UTC