Søren & Antoine,
(all this refers to http://svn.eionet.europa.eu/repositories/Reportnet/docs/Plans%20for%20SEIS%20and%20Reportnet%202010.odt
You are reading the example too literally. That's not a problem with
the intended audience, who doesn't know how the semantic web works in
detail. I had to remove the beginning of the URLs to fit the table on
the page. All the subjects, predicates, objects and sources are full
Ok, understood. I still don't know what URIs you are using. My
intention was to encode the "Source" in this URI so you would not need
a fourth column. But now I prefer a different approach. I think the
problem arises from the fact that you relate the datatset of a source
to a "Subject" (which is kind of a ReportNet reference ID) to early.
See this in detail below.
problem that I'm trying to describe is that the Austrian
agency posts updates to the station data every once in a while. Maybe
they move it, or they upgrade its capabilities. The point is that we
have to be able to deal with conflicting information for just about
every resource we have in the system. When we import a resource into CR
we don't know if it is obsolete or not, so we import everything. We
need named graphs to keep track of the mess. (In the meantime I've read
up on named graphs.)
I see the point. Let me derive some use cases froom this (and from my
own experience with measurement data ...) :
One may add more cases, but I think even this list is hard enough to
manage. The most important aspect is keeping track of where each
statement about a station came from, and this is what you try to solve
with your fourth column "Source" (the one which extends the triple to a
- Station data is reported "every once in a while". This may
include that a station from a previous report ("Stations.xml") simply
is not mentioned in a subsequent report and you do not know if it has
been closed or they just forgot.
- In a subsequent report the station has been apparently "moved".
In your data example, kind of a location is provided only in the "name"
attribute which contains fragments of an address. I guess they also
report some lat/lon indicating the location. From my experience,
whenever a station is moved, it still remains the same station
physically, but from the statistical point of view it becomes a new
station as it now measures in a different place and so there is no
statistical continuity of this physical station any more. For this
reason it is crucial to distinguish physical ids from statistical ids.
- Many physical stations contain multiple sensors each of them
measuring different things. These are attributs of the physical station
regardless to the location of the station. Sensors may be updated with
new methods, they may be removed, and people may add new sensors.
- There may be possibly conflicting information about the same
station (or sensor) coming from different agencies.
You can solve this using named graphs by collecting all statements from
a single source in one graph or similar. But this is not the only way.
A you are making plans for 2010 (which is less than three months
ahead), and you are looking for some stable and scalable solution
("commercial solution providers" on p 17), you might run into problems,
just as Antoine has mentioned:
[Antoine about named graphs:]
"The problem is that they're not part of the official set
web standards, even they are mentioned in SPARQL and are
implemented in one form or the other in almost all RDF stores."
As far as I know, named graphs have been proposed in 2003/2004 as a
solution for the triple/quad discussion, but since then several
ambiguities of this proposal have not been clarified. There has been
continuous work on implementations, such as NG4J by Chris Bizer, but
also http://www4.wiwiss.fu-berlin.de/bizer/ng4 refers to a W3C Webpage
showing a timeline which ends in November 2004 and to the "Named
Graphs, Provenance and Trust" paper from 2005 (which I like).
>From my point of view, it is somehow arguable whether such a state of
standardisation should become a foundation of the ReportNet approach to
Semantic Web in 2010.
You may use named graphs anyway, but as far as I understand your case
this can also be solved using RDF triples only.
I try to give some examples based on these four principles:
- give every dataset found in any source a single ID
- link from this ID to the source so you can track back where it
- add mappings between such ldatasets to clarify whether they mean
the same, something related, or are in conflict.
- distinguish between ReportNet reference stations and stations
described in any of the datasets.
Say, in 2006 you receive a stations.xml file describing a station with
local code 32301 for the first time.
You may add to the registry (using Turtle syntax here):
:0001 rdf:type :station;
Note that I created a new id for this dataset (:0001, not :32301).
I added a statement pointing to the source (which may be described on
its own using :stations2006 as the Subject later).
I also added a GEMET reference to indicate what is observed by this
GEMET might not be appropiate in this role, just take it as an example
of some "reference data" (see p 5 of your document).
You then decide that this is a new station and it should be added to
the ReportNet reference stations:
reportNet:4711 rdf:type :refStation;
>From now on you are expecting continous updates about this station, but
you do not bind it to the local code from the the report. A reference
station needs an ID which is globally uniq in ReportNet, and even the
local code may be changed without changes of the station
Some time later you receive a new version of stations.xml.
It contains unchanged information about the station with local ID
#32301, so you may write.
:0002 rdf:type :station;
Note that i gave this set a different ID :0002. As
nothing has changed, owl:sameAs might be appropriate here. This will
merge the statements of the two instances, so by inference this will
:0001 :fromSource :stations2007.
If you want to make this more explicit, you might add:
reportNet:4711 :reportedIn :0002.
Next time you receive a stations.xml where only the "location" (name)
field has chainged:
:0003 rdf:type :station;
:name "St. Pölten - Eybnerstraße"@de-at;
The value of :name had been "Karlsplatz" before. Comparing
these two sets (which should also contain a lat/lon) you may decide
that the same station has been moved from "Karlsplatz" to "St. Pölten -
In this case you may add:
:0001 :movedIn :0003.
As :0001 is owl:sameAs :0002, from this will be infered :0002
You may also decide that "moving" a station needs some
clarification, so you raise a ticket about this:
:0003 :raisesTicket :8888.
:8888 rdf:type :ticket;
:comment "Austrian station #32301 has moved from Karlsplatz to St.
Pölten - Eybnerstraße".
Of course the tickets might be stored in some non-RDF issue tracker
such as Jira, but you may link to this ticket anyway.
Coming back to the reference station:
reportNet:4711 :hasSomeConflictIn :0003.
Note that I raised the ticket from the reported dataset not from the
reference station, as there might be something wrong with the known
Next time you get a station.xml that does not mention #32301 at all.
You express this saying:
reportNet:4711 :omittedIn :stations2009.
Declare :omittedIn owl:inverseOf :hasOmmittedStation, so
:stations2009 :hasOmmittedStation reportNet:4711.
and you add
:stations2009 :raisesTicket :9999.
This time the source raises the ticket, as a non-existing dataset
cannot raise a ticket itself.
Occasionally you find some different source from anothe agency in an
Excel file named sensors.xls.
You can extract the following:
:0004 rdf:type :sensor;
:name "St. Pölten - Eybnerstraße"@de-at;
This looks rather similar to our example station, but it is not
called a station but a sensor, and it observes a different thing.
Using GEMET, you find that gemet:636(atmospheric pollution)
is skos:broader of gemet:51 (acid rain).
So you might infer that this sensor is part of the station, as both are
using the same localCode.
If so, you can express this as
:0004 skos:broader :0003;
(skos.broader means has broader) or more precisely:
:0003 :hasSensor :0004.
We might vary or extend such examples for hours, but I think it is
enough so far to illustrate this approach, which uses triples only, not
quads or named graphs.
You may say this looks quite complicated, but speaking with Einstein:
make everything as simple as possible but not more simple!
What we have now:
- Each reported dataset is linked to its source.
- There are statements about (dis-)continuity and conflicts in the
- There is linkage between reference stations and datasets from
- There are statements about datasets from different sources that
seem to refer to the same reference station.
Further more you can add statements about sources and reporting
Referring to the above one may write something like:
:someAgency :providerOf (:stations2006 :stations2007 :stations2008
:anotherAgency :providerOf (:sensors2009);
Add more statements about sources:
:sensors2009 :hasFormat "XLS";
and so on.
May be my examples do not exactly map to your case, but hopefully I
showed some patterns that you can vary to meet your case more precisely.
For a better understanding:
What do you need else which cannot be expressed using tripples?
How would you apply named graphs to express the given complexity?
Thomas Bandholtz, email@example.com, http://www.innoq.com
innoQ Deutschland GmbH, Halskestr. 17, D-40880 Ratingen, Germany
Phone: +49 228 9288490 Mobile: +49 178 4049387 Fax: +49 228 9288491