RE: Fwd: Plans for SEIS and Reportnet 2010 - version 1.0 from Søren Roug on 2009-10-19 (public-egov-ig@w3.org from October 2009)

From: Søren Roug <Soren.Roug@eea.europa.eu>
Date: Mon, 19 Oct 2009 11:02:50 +0200
To: Thomas Bandholtz <thomas.bandholtz@innoq.com>
CC: Antoine Isaac <aisaac@few.vu.nl>, Michael Lutz <michael.lutz@jrc.ec.europa.eu>, Stefan Jensen <Stefan.Jensen@eea.europa.eu>, eGovIG IG <public-egov-ig@w3.org>, "johannes.peterseil@umweltbundesamt.at" <johannes.peterseil@umweltbundesamt.at>, Chris Bizer <chris@bizer.de>
Message-ID: <ECDBB484CB145E429AAB0661F935E4E301E679D38330@SLOTHMAIL.eea.eu.int>
Hello Thomas,

I have already begun to develop very similar practices for data that have a reference object - like the stations example below. I found out that just heaping new data on top of old made a mess. In my newer practice the list of stations is maintained by a central authority that publishes the list with the rdf:type :Station. The national station information are then of the type :StationDeclaration and each declaration has an "invented" id. If the station (e.g. 32301) is known on the central list there will be a :declarationFor http://air-climate.eionet.europa.eu/stations.rdf#AT32301.

Whether I can use these principles for every dataflow, I still have to investigate. In my opinion, my next step should be to write a best practices tutorial for the type of data we handle.

--
Sincerely yours / Med venlig hilsen, Søren Roug <soren.roug@eea.europa.eu<mailto:soren.roug@eea.europa.eu>>
European Environment Agency, Kongens Nytorv 6, DK-1050 Copenhagen K
Tel: +45 2368 3660 Jabber: roug@jabber.eea.europa.eu<mailto:roug@jabber.eea..europa.eu>
This email was delivered using 100% recycled electrons. Please try to keep it that way.



From: Thomas Bandholtz [mailto:thomas.bandholtz@innoq.com]
Sent: 18. oktober 2009 12:19
To: Søren Roug
Cc: Antoine Isaac; Michael Lutz; Stefan Jensen; eGovIG IG; johannes.peterseil@umweltbundesamt.at; Chris Bizer
Subject: Re: Fwd: Plans for SEIS and Reportnet 2010 - version 1.0

Søren & Antoine,

(all this refers to http://svn.eionet.europa.eu/repositories/Reportnet/docs/Plans%20for%20SEIS%20and%20Reportnet%202010.odt )

[Søren:]

You are reading the example too literally. That's not a problem with the intended audience, who doesn't know how the semantic web works in detail. I had to remove the beginning of the URLs to fit the table on the page. All the subjects, predicates, objects and sources are full URLs.
Ok, understood. I still don't know what URIs you are using. My intention was to encode the "Source" in this URI so you would not need a fourth column. But now I prefer a different approach. I think the problem arises from the fact that you relate the datatset of a source to a "Subject" (which is kind of a ReportNet reference ID) to early. See this in detail below.

The problem that I'm trying to describe is that the Austrian agency posts updates to the station data every once in a while. Maybe they move it, or they upgrade its capabilities. The point is that we have to be able to deal with conflicting information for just about every resource we have in the system. When we import a resource into CR we don't know if it is obsolete or not, so we import everything. We need named graphs to keep track of the mess. (In the meantime I've read up on named graphs.)
I see the point. Let me derive some use cases froom this (and from my own experience with measurement data ...) :

 1.  Station data is reported "every once in a while". This may include that a station from a previous report ("Stations.xml") simply is not mentioned in a subsequent report and you do not know if it has been closed or they just forgot.
 2.  In a subsequent report the station has been apparently "moved". In your data example, kind of a location is provided only in the "name" attribute which contains fragments of an address. I guess they also report some lat/lon indicating the location. From my experience, whenever a station is moved, it still remains the same station physically, but from the statistical point of view it becomes a new station as it now measures in a different place and so there is no statistical continuity of this physical station any more. For this reason it is crucial to distinguish physical ids from statistical ids.
 3.  Many physical stations contain multiple sensors each of them measuring different things. These are attributs of the physical station regardless to the location of the station. Sensors may be updated with new methods, they may be removed, and people may add new sensors.
 4.  There may be possibly conflicting information about the same station (or sensor) coming from different agencies.
One may add more cases, but I think even this list is hard enough to manage.. The most important aspect is keeping track of where each statement about a station came from, and this is what you try to solve with your fourth column "Source" (the one which extends the triple to a quad).

You can solve this using named graphs by collecting all statements from a single source in one graph or similar. But this is not the only way.

A you are making plans for 2010 (which is less than three months ahead), and you are looking for some stable and scalable solution ("commercial solution providers" on p 17), you might run into problems, just as Antoine has mentioned:

[Antoine about named graphs:]
"The problem is that they're not part of the official set of semantic web standards, even they are mentioned in SPARQL and are implemented in one form or the other in almost all RDF stores."
As far as I know, named graphs have been proposed in 2003/2004 as a solution for the triple/quad discussion, but since then several ambiguities of this proposal have not been clarified. There has been continuous work on implementations, such as NG4J by Chris Bizer, but also http://www4.wiwiss.fu-berlin.de/bizer/ng4 refers to a W3C Webpage showing a timeline which ends in November 2004 and to the "Named Graphs, Provenance and Trust" paper from 2005 (which I like).
>From my point of view, it is somehow arguable whether such a state of standardisation should become a foundation of the ReportNet approach to Semantic Web in 2010.

You may use named graphs anyway, but as far as I understand your case this can also be solved using RDF triples only.

I try to give some examples based on these four principles:

 *   give every dataset found in any source a single ID
 *   link from this ID to the source so you can track back where it came from
 *   add mappings between such ldatasets to clarify whether they mean the same, something related, or are in conflict.
 *   distinguish between ReportNet reference stations and stations described in any of the datasets.

Say, in 2006 you receive a stations.xml file describing a station with local code 32301 for the first time.
You may add to the registry (using Turtle syntax here):

:0001 rdf:type :station;
    :fromSource :stations2006;
    :localCode "32301";
    :name "Karlsplatz"@de-at;
    :observes gemet:636.

Note that I created a new id for this dataset (:0001, not :32301).
I added a statement pointing to the source (which may be described on its own using :stations2006 as the Subject later).
I also added a GEMET reference to indicate what is observed by this station..
GEMET might not be appropiate in this role, just take it as an example of some "reference data" (see p 5 of your document).

You then decide that this is a new station and it should be added to the ReportNet reference stations:

reportNet:4711 rdf:type :refStation;
   :reportedIn :0001.

>From now on you are expecting continous updates about this station, but you do not bind it to the local code from the the report. A reference station needs an ID which is globally uniq in ReportNet, and even the local code may be changed without changes of the station characteristics.

Some time later you receive a new version of stations.xml.
It contains unchanged information about the station with local ID #32301, so you may write.

:0002 rdf:type :station;
    :fromSource :stations2007;
    owl:sameAs :0001.

Note that i gave this set a different ID :0002. As nothing has changed,  owl:sameAs might be appropriate here. This will merge the statements of the two instances, so by inference this will add:

:0001 :fromSource :stations2007.

If you want to make this more explicit, you might add:

reportNet:4711 :reportedIn :0002.

Next time you receive a stations.xml where only the "location" (name) field  has chainged:

:0003 rdf:type :station;
    :fromSource :stations2008;
    :localCode "32301";
    :name "St. Pölten - Eybnerstraße"@de-at;
    :observes gemet:636.

The value of :name had been "Karlsplatz" before. Comparing these two sets (which should also contain a lat/lon) you may decide that the same station has been moved from "Karlsplatz" to "St. Pölten - Eybnerstraße".

In this case you may add:

:0001 :movedIn :0003.

As :0001 is owl:sameAs :0002, from this will be infered :0002 :movedIn :0003.

You may also decide that "moving" a station needs some clarification, so you raise a ticket about this:

:0003 :raisesTicket :8888.

:8888 rdf:type :ticket;
    :reason :hasMoved;
    :comment "Austrian station #32301 has moved from Karlsplatz to St. Pölten - Eybnerstraße".

Of course the tickets might be stored in some non-RDF issue tracker such as Jira, but you may link to this ticket anyway.

Coming back to the reference station:

reportNet:4711 :hasSomeConflictIn :0003.

Note that I raised the ticket from the reported dataset not from the reference station, as there might be something wrong with the known dataset.

Next time you get a station.xml that does not mention #32301 at all. You express this saying:

reportNet:4711 :omittedIn :stations2009.

Declare :omittedIn owl:inverseOf :hasOmmittedStation, so you receive

:stations2009 :hasOmmittedStation reportNet:4711.

and you add

:stations2009 :raisesTicket  :9999.

This time the source raises the ticket, as a non-existing dataset cannot raise a ticket itself.

Occasionally you find some different source from anothe agency in an Excel file named sensors.xls.
You can extract the following:

:0004 rdf:type :sensor;
    :fromSource :sensors2009;
    :localCode "32301";
    :name "St. Pölten - Eybnerstraße"@de-at;
    :observes gemet:51.

This looks rather similar to our example station, but it is not called a station but a sensor, and it observes a different thing.
Using GEMET, you find that gemet:636(atmospheric pollution) is skos:broader of gemet:51 (acid rain).

So you might infer that this sensor is part of the station, as both are using the same localCode.
If so, you can express this as

:0004 skos:broader :0003;

(skos.broader means has broader) or more precisely:

:0003 :hasSensor :0004.

We might vary or extend such examples for hours, but I think it is enough so far to illustrate this approach, which uses triples only, not quads or named graphs.

You may say this looks quite complicated, but speaking with Einstein: make everything as simple as possible but not more simple!

What we have now:

 *   Each reported dataset is linked to its source.
 *   There are statements about (dis-)continuity and conflicts in the reporting sequence.
 *   There is linkage between reference stations and datasets from reports
 *   There are statements about datasets from different sources that seem to refer to the same reference station.

Further more you can add statements about sources and reporting agencies.
Referring to the above one may write something like:

:someAgency :providerOf (:stations2006 :stations2007 :stations2008 :stations2009);
    :isTrusted :true;
    :inGoodStanding :medium.

:anotherAgency :providerOf (:sensors2009);
    :isTrusted :undecided;
    :inGoodStanding :newbie.

Add more statements about sources:

:sensors2009 :hasFormat "XLS";
    :submissionDate "2009-10-10".

and so on.

May be my examples do not exactly map to your case, but hopefully I showed some patterns that you can vary to meet your case more precisely.

For a better understanding:
What do you need else which cannot be expressed using tripples?
How would you apply named graphs to express the given complexity?

Best regards,
Thomas


--

Thomas Bandholtz, thomas.bandholtz@innoq.com<mailto:thomas.bandholtz@innoq.com>, http://www.innoq.com

innoQ Deutschland GmbH, Halskestr. 17, D-40880 Ratingen, Germany

Phone: +49 228 9288490 Mobile: +49 178 4049387 Fax: +49 228 9288491
Received on Monday, 19 October 2009 14:47:59 UTC