Re: Wikidata export in RDF

Hi Denny,

I haven't followed the thread completely, but I see a problem when using 
IRIs instead of blank nodes for n-ary relations, namely deciding which 
triples to include when publishing Linked Data.

Consider an example I (partially) made up:

 w:Berlin s:Population Berlin:Statement1 .
 Berlin:Statement1 rdf:type o:Statement .
 Berlin:Statement1 v:Population "3499879"^^xsd:integer .
 w:Berlin s:CapitalOf w:Germany .
 w:Germany s:Population Germany:Statement1 .

A Linked Data lookup on w:Berlin should return the first four triples, 
including the triples about Berlin:Statement1, but should exclude the 
last triple. However, it is not straightforward to differentiate between 
these triples as they are both connected via an IRI.

Using blank nodes, this distinction would become clear:

 w:Berlin s:Population _:x .
 _:x rdf:type o:Statement .
 _:x v:Population "3499879"^^xsd:integer .
 w:Berlin s:CapitalOf w:Germany .
 w:Germany s:Population Germany:Statement1 .

This idea is used in the concept of concise bounded description (CBD) 
[1] of an RDF resource, which states that a CBD of resouce is:

1. Include in the subgraph all statements in the source graph where the 
subject of the statement is the starting node;
2. Recursively, for all statements identified in the subgraph thus far 
having a blank node object, include in the subgraph all statements in 
the source graph where the subject of the statement is the blank node in 
question and which are not already included in the subgraph.

I think that at least Virtuoso directly supports returning CBDs for 
Linked Data lookups.

I don't know how much of a problem this is for the Wikidata software 
itself as it might know which statements to include, but it is certainly 
different for third parties that only use the RDF export.

Cheers,
Günter

[1] http://www.w3.org/Submission/CBD/

On 09.08.12 10:21, Denny Vrandečić wrote:
> Hi Daniel,
>
> thank you for the comments. This further validates the approach we
> have selected. I am also happy to see the relevant Provenance ontology
> properties listed for easier reference.
>
> I dislike blank nodes due to several reasons, and I do not see any
> advantage for consumers or reusers of data when blank nodes are used.
> I see a minor advantage for authors, as they can omit the work of
> creating an IRI. If someone from the outside wanted to address a
> statement from Wikidata, e.g. to state that they like it, or that they
> consider it not true, etc., a blank node would not allow them to do
> so. IRIs seem a more natural choice for a web that wants to further
> interconnection and reuse.
>
> Cheers,
> Denny
>
>
>
> 2012/8/8 Daniel Garijo <dgarijo@fi.upm.es>:
>> Hi Denny,
>> sorry for jumping in a bit late in the thread.
>> In the Ontology Engineering Group we published last year a whole provenance
>> dataset [1]
>> relying on the Open Provenance Model [2], which also uses the n-ary pattern
>> to qualify
>> some properties (in a very similar way to PROV). Although we are moving
>> towards PROV,
>> it may illustrate you how to publish and exploit your data with a lot of
>> examples :)
>>
>> If you are planning to add any provenance information (by looking at the
>> wiki the properties
>> that may be useful for you are prov:wasDerivedFrom, prov:wasRevisionOf,
>> prov:wasInfluencedBy
>> or prov:hadPrimarySource, as Jun suggested) I would like to encourage you to
>> align your approach
>> with PROV's, it will make your records more interoperable.
>>
>> Finally (and just for the record) you don't need to create ids for the
>> qualified statements when
>> you want to add extra information. Sometimes creating a blank node is
>> enough. For example,
>> the qualified ground triple could be represented as:
>> w:Berlin s:Population [
>>          rdf:type o:Statement ;
>>          v:Population "3499879"^^xsd:integer ;
>>          q:As_of "2011-11-30"^^xsd:date ;
>>          q:Method w:Extrapolation ;
>>          rdfs:label "3,499,879 (As of Nov 30, 2011, Method Extrapolation)"^en
>> .
>>      ]
>> The approach you have followed is also valid.
>> I hope this helps.
>>
>> Cheers,
>> Daniel
>>
>> [1].- http://webenemasuno.linkeddata.es/index_en.html, (SPARQL with examples
>> at http://webenemasuno.linkeddata.es/sparql_en.html)
>> [2].- http://openprovenance.org/
>>
>> 2012/8/8 Denny Vrandečić <denny.vrandecic@wikimedia.de>
>>>
>>> Hi Hugh,
>>>
>>> thank you for the pointer. I had heard about CIDOC CRM, but I have not
>>> had realized how close it is to what we are doing. My trouble is that
>>> there are now at least 200 pages of specification for CIDOC CRM, and I
>>> tried to take a look at it, but I do not have the time to become an
>>> expert in CIDOC CRM myself.
>>>
>>> I either invite someone to create a draft of how our data model
>>> interplays with CIDOC CRM (E17 seems specifically) and what effect
>>> this has on our export (my assumption is that some of the URIs we use
>>> can actually be replaced by CIDOC URIs), or to have a discussion with
>>> me to see how they fit together.
>>>
>>> But thank you very much, this seems to be indeed much closer to
>>> Wikidata than I expected, by far.
>>>
>>> Cheers,
>>> Denny
>>>
>>>
>>> 2012/8/8 Hugh Glaser <hg@ecs.soton.ac.uk>:
>>>> Hi Denny,
>>>> Great stuff.
>>>> I've been watching the discussion, and am puzzled a bit about what you
>>>> are modelling.
>>>> This message gets me to ask :-)
>>>> What you are doing looks very like what cultural heritage (museums,
>>>> libraries, archeologists, etc.) do.
>>>> A model that seems to work very well for all this is CIDOC/CRM, which is
>>>> in active use or consideration by a wide range of organisations from
>>>> cultural heritage.
>>>> http://www.cidoc-crm.org/official_release_cidoc.html
>>>>
>>>> It is what we used for some work at the British Museum, and OCLC,
>>>> Europeana, and various archeological places, for example.
>>>>
>>>> CIDOC/CRM is an event-based model.
>>>> This means it looks slightly strange to people who are used to making
>>>> statements about things, and think they are true, rather than an opinion,
>>>> but it comes very naturally once the power is understood; and it is not hard
>>>> to query.
>>>>
>>>> But it copes with statements made at different times, and even
>>>> conflicting ones.
>>>>
>>>> Just wondering if it is a closely related application area, and if you
>>>> had considered it.
>>>> Best
>>>> Hugh
>>>>
>>>>
>>>> On 8 Aug 2012, at 13:05, Denny Vrandečić <denny.vrandecic@wikimedia.de>
>>>>   wrote:
>>>>
>>>>> Hi Jun,
>>>>>
>>>>> thank you for taking the time to look into the document and comment it.
>>>>>
>>>>> I expect that Wikipedia will almost never be the source for a
>>>>> statement expressed in Wikidata, as Wikipedia will probably not be
>>>>> regarded as a reliable source.
>>>>>
>>>>> In the example you link to are two statements:
>>>>>
>>>>> 1) Berlin has a population of 3,499,879 as of Nov 30, 2011, the method
>>>>> for deriving this was an extrapolation
>>>>>
>>>>> 2) Berlin has a population of 8,000 as of the 15th century
>>>>>
>>>>> Statement 1 has in the example no sources, but a good source would be
>>>>> the statistical yearbook of Germany, 2012 edition.
>>>>>
>>>>> Statement 2 has one source in the example, and this could be, e.g. a
>>>>> scientific paper about the development of the population of European
>>>>> cities in mediavel times.
>>>>>
>>>>> The method and the time are both not provenance information, but
>>>>> qualifiers of the statements and thus part of the statement.
>>>>>
>>>>> Every statement has an IRI. And the source will also have an IRI
>>>>> describing it (i.e. an IRI for the statistical yearbook, an IRI for
>>>>> the mentioned paper).
>>>>>
>>>>> What I did not figure out is: which property from the provenance
>>>>> ontology can I use to connect the statement IRI to the source IRI?
>>>>>
>>>>> Thank you for your help!
>>>>>
>>>>> Cheers,
>>>>> Denny
>>>>>
>>>>>
>>>>>
>>>>> 2012/8/8 Jun Zhao <jun.zhao@zoo.ox.ac.uk>:
>>>>>> Hi Denny,
>>>>>>
>>>>>> I have been looking for the motivation of this work on your page [1].
>>>>>> I
>>>>>> guessed that your main goal was trying to express facts about the same
>>>>>> entity but coming from different perspectives and sources? Did you get
>>>>>> all
>>>>>> of these diverse facts from wikipedia? It will be nice to have one
>>>>>> step
>>>>>> further provenance statements than just saying "Method Extrapolation"
>>>>>> or "as
>>>>>> of 15th century".
>>>>>>
>>>>>> The patterns you used here are highly related to PROV [2,3],
>>>>>> particularly
>>>>>> the bundle and qualification structure of the latest PROV data model.
>>>>>> Please
>>>>>> do not hesitate to ping us if you find any impracticality or even
>>>>>> problems
>>>>>> in the current model. We will really appreciate your feedback!
>>>>>>
>>>>>> [1] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF
>>>>>> [2] http://www.w3.org/TR/prov-dm/
>>>>>> [3] http://www.w3.org/TR/prov-o/
>>>>>>
>>>>>> Good work!
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Jun
>>>>>>
>>>>>>
>>>>>> On 08/08/2012 11:32, Denny Vrandečić wrote:
>>>>>>>
>>>>>>> Ivan,
>>>>>>>
>>>>>>> thank you! That is reassuring to hear that we are not without
>>>>>>> precedent :)
>>>>>>>
>>>>>>> We are investigating how we could use the provenance ontology, as we
>>>>>>> sure would like to reuse existing stuff instead of inventing new one.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Denny
>>>>>>>
>>>>>>> 2012/8/7 Ivan Herman <ivan@w3.org>:
>>>>>>>>
>>>>>>>> Denny,
>>>>>>>>
>>>>>>>> fwiw, the approach you take is very similar to what the Provenance
>>>>>>>> Working group took in the upcoming Prov vocabulary. Look, for
>>>>>>>> example, in
>>>>>>>>
>>>>>>>> http://www.w3.org/TR/prov-primer/
>>>>>>>>
>>>>>>>> and for 'qualifiedXXXX'. Essentially, if there is a property 'p'
>>>>>>>> then the
>>>>>>>> 'qualifiedP' is another property whose range is an object of a
>>>>>>>> specific type
>>>>>>>> that has the other information. Same as our s:Population.
>>>>>>>>
>>>>>>>> As an aside, the good thing is that it may make it easier to use the
>>>>>>>> provenance vocabulary in your setup if you want to:-)
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> Ivan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Aug 6, 2012, at 12:03 , Denny Vrandečić wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> we have created the first draft of the Wikidata export in RDF.
>>>>>>>>>
>>>>>>>>> <http://meta.wikimedia.org/wiki/Wikidata/Development/RDF>
>>>>>>>>>
>>>>>>>>> I am inviting the Semantic Web and Linked Data community to a
>>>>>>>>> discussion about it.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Denny
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Project director Wikidata
>>>>>>>>> Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
>>>>>>>>> Tel. +49-30-219 158 26-0 | http://wikimedia.de
>>>>>>>>>
>>>>>>>>> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens
>>>>>>>>> e.V.
>>>>>>>>> Eingetragen im Vereinsregister des Amtsgerichts
>>>>>>>>> Berlin-Charlottenburg
>>>>>>>>> unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
>>>>>>>>> Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ----
>>>>>>>> Ivan Herman, W3C Semantic Web Activity Lead
>>>>>>>> Home: http://www.w3.org/People/Ivan/
>>>>>>>> mobile: +31-641044153
>>>>>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jun Zhao, PhD
>>>>>> EPSRC Postdoctoral Fellow
>>>>>> Department of Zoology
>>>>>> University of Oxford
>>>>>> Tinbergen Building, South Parks Road
>>>>>> Oxford, OX1 3PS, UK
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Project director Wikidata
>>>>> Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
>>>>> Tel. +49-30-219 158 26-0 | http://wikimedia.de
>>>>>
>>>>> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
>>>>> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
>>>>> unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
>>>>> Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Project director Wikidata
>>> Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
>>> Tel. +49-30-219 158 26-0 | http://wikimedia.de
>>>
>>> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
>>> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
>>> unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
>>> Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
>>>
>>
>
>
>


-- 
Dipl.-Inform. Günter Ladwig

Karlsruhe Institute of Technology (KIT)
Institute AIFB

Englerstraße 11 (Building 11.40, Room 238)
76131 Karlsruhe, Germany
Phone: +49 721 608-44754
Email: guenter.ladwig@kit.edu
Web: www.aifb.kit.edu

KIT – University of the State of Baden-Württemberg and National 
Large-scale Research Center of the Helmholtz Association

Received on Friday, 10 August 2012 12:47:38 UTC