Re: Fwd: Plans for SEIS and Reportnet 2010 - version 1.0 from Søren Roug on 2009-10-16 (public-egov-ig@w3.org from October 2009)

From: Søren Roug <soren.roug@eea.europa.eu>
Date: Fri, 16 Oct 2009 08:56:58 +0200
To: Thomas Bandholtz <thomas.bandholtz@innoq.com>
CC: Antoine Isaac <aisaac@few.vu.nl>, Michael Lutz <michael.lutz@jrc.ec.europa.eu>, Stefan Jensen <Stefan.Jensen@eea.europa.eu>, eGovIG IG <public-egov-ig@w3.org>, "johannes.peterseil@umweltbundesamt.at" <johannes.peterseil@umweltbundesamt.at>
Message-ID: <4AD8193A.9090803@eea.europa.eu>
Hello Thomas,

You are reading the example too literally. That's not a problem with the 
intended audience, who doesn't know how the semantic web works in 
detail. I had to remove the beginning of the URLs to fit the table on 
the page. All the subjects, predicates, objects and sources are /full/ 
URLs. The problem that I'm trying to describe is that the Austrian 
agency posts updates to the station data every once in a while. Maybe 
they move it, or they upgrade its capabilities. The point is that we 
have to be able to deal with conflicting information for just about 
every resource we have in the system. When we import a resource into CR 
we don't know if it is obsolete or not, so we import everything. We need 
named graphs to keep track of the mess. (In the meantime I've read up on 
named graphs.)



On 15-10-2009 23:30, Thomas Bandholtz wrote:
> Hi Søren & Antoine,
>
> I do not think this needs named graphs, and it isn't really a 
> quadruple (at least logically).
> Sure, Søren's database table example has /four /columns, but two of 
> them (Source *stations.xml* and Subject *#322301*) together try to 
> describe a resource.
>
> Why not identify this resource as "*stations.xml#322301*" in a single 
> column?
> This comes closer to a URI reference of the Subject.
>
> But, what kind of source ist identified by "stations.xml"?
> There may be millions of files named "stations.xml" all over the world 
> (and how many of them on Søren 's file server?)
> In Søren's document [1] we can read on page 7 as an example, that 
> "Austria reports two new Airbase stations. .. They could store the 
> information in an XML file called stations.xml".
> I guess this is some /agency /from Austria, so the resource may be 
> described as something like "*someAgency.at/airBaseStations.xml#322301*".
>
> Further more, they report /new /stations, so there might be multiple 
> historical versions of stations.xml.
> Resource now becomes "*someAgency.at/2009/airBaseStations.xml#322301*".
>
> Well, this is not yet a URI, the protocol is missing. So I just name it:
> "*http://someAgency.at/2009/airBaseStations.xml#322301*".
>
> (For some reasons I would prefer to name it 
> "http://someAgency.at/2009/airBaseStations/322301", but this is not 
> important here).
>
> Well, this does not resolve, I just invented it.
> Actually, a URI reference of a RDF resource does not need to resolve.
> This is something we expect from linking open data, but not from RDF 
> in general.
>
> As Søren is writing about /plans /for 2010, we might say that this 
> Austrian agency should make plans to publish such stations using 
> resolving URIs in 2010.
>
> You may say this plan is not realistic, but we shoud express it.
> Otherwise we get stuck in the inherited architecture.
> And may be it /is /realistic, I discussed with some people from 
> umweltbundesamt.at just one week ago ...
>
> What I want to say is: if you identify the resource by a single URI 
> reference then the table example would not need four columns, but only 
> three.
> And now the Subject identifier even includes the providing agency and 
> a version, just like any URI reference in Semantic Web should.
>
> The named graph pattern in this case makes everything more complicated 
> than it is.
>
> Something related about EEA-GEMET-JRC:
> Søren says they established API and replication two years ago because 
> they did not know better at that time.
> Today Søren knows better. Why then do you make plans based on this 
> depricated architecture?
> Such plans will get you embroiled in redundancy and replication deaper 
> and deaper, what a mess!
> Why don't you make plans to establish a linked data architecture for a 
> federated vocabulary EEA-GEMET-JRC?
>
> Best wishes & regards,
> Thomas
>
> [1] 
> http://svn.eionet.europa.eu/repositories/Reportnet/docs/Plans%20for%20SEIS%20and%20Reportnet%202010.odt 
>
>
> Antoine Isaac schrieb:
>> Hello Søren,
>>
>>
>>> I like these questions. The force me to sharpen my arguments, and 
>>> they give ideas to improve our plans.
>>
>> Great! Thanks for the anwers, it's very interesting to hear.
>>
>>
>>> Yes, that's correct. We have a source on every triple. They serve 
>>> two purposes. We know which triples to throw out when we do a 
>>> reharvesting of a source. And they can be used to determine 
>>> trustworthiness of a statement the same way a user with a webbrowser 
>>> looks at the webpage's URL. I'm not very pleased with the last 
>>> purpose, but it seems to be the only mechanism people can't lie 
>>> about. I've heard about named graphs, but haven't figured out their 
>>> purpose yet.
>>
>> Well, they're kind of multi-purpose. Practically, they transform 
>> triples into quadruples, and thus allow to track the source of 
>> statements.
>> The problem is that they're not part of the official set of semantic 
>> web standards, even they are mentioned in SPARQL [1] and are 
>> implemented in one form or the other in almost all RDF stores.
>>
>> About the more general provenance issue, you might be interested in 
>> following the work of the W3C provenance group, which has just been 
>> created [2] (nothing there, yet). I'd expect them formalize some 
>> interesting practices on those aspects...
>>
>>
>>> We have our own database structure, we add inferred triples and as 
>>> long as we only have subPropertyOf and subClassOf, it is manageable. 
>>> We don't get an explosion of triples. But we know we're getting to 
>>> the end of its capacities and yesterday we launched a study of 
>>> Virtuoso and Jena. If they don't work, we'll look at some more.
>>
>> I hope you'll find the good one! By the way, if you've not found them 
>> yet, there are some benchmarks available around, like [3]. They are 
>> very context-specific, though, I'd be curious to know whether such 
>> stuff is actually useful to real implementors.
>>
>>
>>>
>>> I would actually prefer to be able to launch a distributed SPARQL 
>>> query that automatically understood sameAs statements across 
>>> servers, but I've not seen anybody advertising SPARQL as being able 
>>> to do that. 
>>
>>
>> Yes, for the moment it would be up to SPARQL endpoints to perform the 
>> appropriate distribution and aggregation of results. Needless to say, 
>> the efforts in that field (I know that [4] are working on this) at 
>> very early stage.
>>
>>
>>> Additionally, some of our member organisations are writing their RDF 
>>> in notepad. I don't think I could get them to set up a SPARQL 
>>> service. We're also aware of the principle of following a resource 
>>> URL with your webbrowser to see a factsheet of the resource, but it 
>>> doesn't really work because we're mainly interested in bulk 
>>> operations on resources.
>>
>> OK!
>>
>> Best,
>>
>> Antoine
>>
>> [1] http://www.w3.org/TR/rdf-sparql-query/#rdfDataset
>> [2] http://www.w3.org/2005/Incubator/prov/
>> [3] http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/
>> [4] http://www.openrdf.org/
>
>
> -- 
> Thomas Bandholtz,thomas.bandholtz@innoq.com,http://www.innoq.com
> innoQ Deutschland GmbH, Halskestr. 17, D-40880 Ratingen, Germany
> Phone: +49 228 9288490 Mobile: +49 178 4049387 Fax: +49 228 9288491
>
Received on Friday, 16 October 2009 08:39:32 UTC