Re: Wikidata export in RDF from Markus Krötzsch on 2012-08-15 (semantic-web@w3.org from August 2012)

From: Markus Krötzsch <markus.kroetzsch@cs.ox.ac.uk>
Date: Wed, 15 Aug 2012 11:00:30 +0100
To: Günter Ladwig <guenter.ladwig@kit.edu>
CC: Denny Vrandečić <denny.vrandecic@wikimedia.de>, "semantic-web@w3.org Web" <semantic-web@w3.org>
Message-ID: <502B733E.3040609@cs.ox.ac.uk>
Hi Günter,

if I understand you correctly, you say that blank nodes are useful as a 
syntactic hint for grouping triples. I see the utility of this idea, but 
it strikes me as a major hack. Blank nodes have been introduced in order 
to have elements with a special semantics -- I am not saying that we 
want or need this semantics, but that's how it is. Syntactic grouping is 
a completely unrelated task. If a grouping scheme for triples is needed 
(IMHO it is; this is also in the spirit of Wikidata statements), then a 
more robust and flexible mechanism should be found. Named graphs maybe.

When using blank nodes, it becomes impossible to address statements. So 
one cannot browse through individual statements LOD style. Rather, one 
would have to rely on CBD to be implemented in such a way that the right 
triples are served for particular IRIs. So using blank nodes is not 
making CBD possible, it is making it mandatory. Going from IRIs to 
bnodes is easier than the other way around.

Best,

Markus


On 10/08/12 13:47, Günter Ladwig wrote:
> Hi Denny,
>
> I haven't followed the thread completely, but I see a problem when using
> IRIs instead of blank nodes for n-ary relations, namely deciding which
> triples to include when publishing Linked Data.
>
> Consider an example I (partially) made up:
>
>      w:Berlin s:Population Berlin:Statement1 .
>      Berlin:Statement1 rdf:type o:Statement .
>      Berlin:Statement1 v:Population "3499879"^^xsd:integer .
>      w:Berlin s:CapitalOf w:Germany .
>      w:Germany s:Population Germany:Statement1 .
>
> A Linked Data lookup on w:Berlin should return the first four triples,
> including the triples about Berlin:Statement1, but should exclude the
> last triple. However, it is not straightforward to differentiate between
> these triples as they are both connected via an IRI.
>
> Using blank nodes, this distinction would become clear:
>
>      w:Berlin s:Population _:x .
>      _:x rdf:type o:Statement .
>      _:x v:Population "3499879"^^xsd:integer .
>      w:Berlin s:CapitalOf w:Germany .
>      w:Germany s:Population Germany:Statement1 .
>
> This idea is used in the concept of concise bounded description (CBD)
> [1] of an RDF resource, which states that a CBD of resouce is:
>
> 1. Include in the subgraph all statements in the source graph where the
> subject of the statement is the starting node;
> 2. Recursively, for all statements identified in the subgraph thus far
> having a blank node object, include in the subgraph all statements in
> the source graph where the subject of the statement is the blank node in
> question and which are not already included in the subgraph.
>
> I think that at least Virtuoso directly supports returning CBDs for
> Linked Data lookups.
>
> I don't know how much of a problem this is for the Wikidata software
> itself as it might know which statements to include, but it is certainly
> different for third parties that only use the RDF export.
>
> Cheers,
> Günter
>
> [1] http://www.w3.org/Submission/CBD/
>
> On 09.08.12 10:21, Denny Vrandečić wrote:
>> Hi Daniel,
>>
>> thank you for the comments. This further validates the approach we
>> have selected. I am also happy to see the relevant Provenance ontology
>> properties listed for easier reference.
>>
>> I dislike blank nodes due to several reasons, and I do not see any
>> advantage for consumers or reusers of data when blank nodes are used.
>> I see a minor advantage for authors, as they can omit the work of
>> creating an IRI. If someone from the outside wanted to address a
>> statement from Wikidata, e.g. to state that they like it, or that they
>> consider it not true, etc., a blank node would not allow them to do
>> so. IRIs seem a more natural choice for a web that wants to further
>> interconnection and reuse.
>>
>> Cheers,
>> Denny
>>
>>
>>
>> 2012/8/8 Daniel Garijo <dgarijo@fi.upm.es>:
>>> Hi Denny,
>>> sorry for jumping in a bit late in the thread.
>>> In the Ontology Engineering Group we published last year a whole
>>> provenance
>>> dataset [1]
>>> relying on the Open Provenance Model [2], which also uses the n-ary
>>> pattern
>>> to qualify
>>> some properties (in a very similar way to PROV). Although we are moving
>>> towards PROV,
>>> it may illustrate you how to publish and exploit your data with a lot of
>>> examples :)
>>>
>>> If you are planning to add any provenance information (by looking at the
>>> wiki the properties
>>> that may be useful for you are prov:wasDerivedFrom, prov:wasRevisionOf,
>>> prov:wasInfluencedBy
>>> or prov:hadPrimarySource, as Jun suggested) I would like to encourage
>>> you to
>>> align your approach
>>> with PROV's, it will make your records more interoperable.
>>>
>>> Finally (and just for the record) you don't need to create ids for the
>>> qualified statements when
>>> you want to add extra information. Sometimes creating a blank node is
>>> enough. For example,
>>> the qualified ground triple could be represented as:
>>> w:Berlin s:Population [
>>>          rdf:type o:Statement ;
>>>          v:Population "3499879"^^xsd:integer ;
>>>          q:As_of "2011-11-30"^^xsd:date ;
>>>          q:Method w:Extrapolation ;
>>>          rdfs:label "3,499,879 (As of Nov 30, 2011, Method
>>> Extrapolation)"^en
>>> .
>>>      ]
>>> The approach you have followed is also valid.
>>> I hope this helps.
>>>
>>> Cheers,
>>> Daniel
>>>
>>> [1].- http://webenemasuno.linkeddata.es/index_en.html, (SPARQL with
>>> examples
>>> at http://webenemasuno.linkeddata.es/sparql_en.html)
>>> [2].- http://openprovenance.org/
>>>
>>> 2012/8/8 Denny Vrandečić <denny.vrandecic@wikimedia.de>
>>>>
>>>> Hi Hugh,
>>>>
>>>> thank you for the pointer. I had heard about CIDOC CRM, but I have not
>>>> had realized how close it is to what we are doing. My trouble is that
>>>> there are now at least 200 pages of specification for CIDOC CRM, and I
>>>> tried to take a look at it, but I do not have the time to become an
>>>> expert in CIDOC CRM myself.
>>>>
>>>> I either invite someone to create a draft of how our data model
>>>> interplays with CIDOC CRM (E17 seems specifically) and what effect
>>>> this has on our export (my assumption is that some of the URIs we use
>>>> can actually be replaced by CIDOC URIs), or to have a discussion with
>>>> me to see how they fit together.
>>>>
>>>> But thank you very much, this seems to be indeed much closer to
>>>> Wikidata than I expected, by far.
>>>>
>>>> Cheers,
>>>> Denny
>>>>
>>>>
>>>> 2012/8/8 Hugh Glaser <hg@ecs.soton.ac.uk>:
>>>>> Hi Denny,
>>>>> Great stuff.
>>>>> I've been watching the discussion, and am puzzled a bit about what you
>>>>> are modelling.
>>>>> This message gets me to ask :-)
>>>>> What you are doing looks very like what cultural heritage (museums,
>>>>> libraries, archeologists, etc.) do.
>>>>> A model that seems to work very well for all this is CIDOC/CRM,
>>>>> which is
>>>>> in active use or consideration by a wide range of organisations from
>>>>> cultural heritage.
>>>>> http://www.cidoc-crm.org/official_release_cidoc.html
>>>>>
>>>>> It is what we used for some work at the British Museum, and OCLC,
>>>>> Europeana, and various archeological places, for example.
>>>>>
>>>>> CIDOC/CRM is an event-based model.
>>>>> This means it looks slightly strange to people who are used to making
>>>>> statements about things, and think they are true, rather than an
>>>>> opinion,
>>>>> but it comes very naturally once the power is understood; and it is
>>>>> not hard
>>>>> to query.
>>>>>
>>>>> But it copes with statements made at different times, and even
>>>>> conflicting ones.
>>>>>
>>>>> Just wondering if it is a closely related application area, and if you
>>>>> had considered it.
>>>>> Best
>>>>> Hugh
>>>>>
>>>>>
>>>>> On 8 Aug 2012, at 13:05, Denny Vrandečić
>>>>> <denny.vrandecic@wikimedia.de>
>>>>>   wrote:
>>>>>
>>>>>> Hi Jun,
>>>>>>
>>>>>> thank you for taking the time to look into the document and
>>>>>> comment it.
>>>>>>
>>>>>> I expect that Wikipedia will almost never be the source for a
>>>>>> statement expressed in Wikidata, as Wikipedia will probably not be
>>>>>> regarded as a reliable source.
>>>>>>
>>>>>> In the example you link to are two statements:
>>>>>>
>>>>>> 1) Berlin has a population of 3,499,879 as of Nov 30, 2011, the
>>>>>> method
>>>>>> for deriving this was an extrapolation
>>>>>>
>>>>>> 2) Berlin has a population of 8,000 as of the 15th century
>>>>>>
>>>>>> Statement 1 has in the example no sources, but a good source would be
>>>>>> the statistical yearbook of Germany, 2012 edition.
>>>>>>
>>>>>> Statement 2 has one source in the example, and this could be, e.g. a
>>>>>> scientific paper about the development of the population of European
>>>>>> cities in mediavel times.
>>>>>>
>>>>>> The method and the time are both not provenance information, but
>>>>>> qualifiers of the statements and thus part of the statement.
>>>>>>
>>>>>> Every statement has an IRI. And the source will also have an IRI
>>>>>> describing it (i.e. an IRI for the statistical yearbook, an IRI for
>>>>>> the mentioned paper).
>>>>>>
>>>>>> What I did not figure out is: which property from the provenance
>>>>>> ontology can I use to connect the statement IRI to the source IRI?
>>>>>>
>>>>>> Thank you for your help!
>>>>>>
>>>>>> Cheers,
>>>>>> Denny
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2012/8/8 Jun Zhao <jun.zhao@zoo.ox.ac.uk>:
>>>>>>> Hi Denny,
>>>>>>>
>>>>>>> I have been looking for the motivation of this work on your page
>>>>>>> [1].
>>>>>>> I
>>>>>>> guessed that your main goal was trying to express facts about the
>>>>>>> same
>>>>>>> entity but coming from different perspectives and sources? Did
>>>>>>> you get
>>>>>>> all
>>>>>>> of these diverse facts from wikipedia? It will be nice to have one
>>>>>>> step
>>>>>>> further provenance statements than just saying "Method
>>>>>>> Extrapolation"
>>>>>>> or "as
>>>>>>> of 15th century".
>>>>>>>
>>>>>>> The patterns you used here are highly related to PROV [2,3],
>>>>>>> particularly
>>>>>>> the bundle and qualification structure of the latest PROV data
>>>>>>> model.
>>>>>>> Please
>>>>>>> do not hesitate to ping us if you find any impracticality or even
>>>>>>> problems
>>>>>>> in the current model. We will really appreciate your feedback!
>>>>>>>
>>>>>>> [1] http://meta.wikimedia.org/wiki/Wikidata/Development/RDF
>>>>>>> [2] http://www.w3.org/TR/prov-dm/
>>>>>>> [3] http://www.w3.org/TR/prov-o/
>>>>>>>
>>>>>>> Good work!
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Jun
>>>>>>>
>>>>>>>
>>>>>>> On 08/08/2012 11:32, Denny Vrandečić wrote:
>>>>>>>>
>>>>>>>> Ivan,
>>>>>>>>
>>>>>>>> thank you! That is reassuring to hear that we are not without
>>>>>>>> precedent :)
>>>>>>>>
>>>>>>>> We are investigating how we could use the provenance ontology,
>>>>>>>> as we
>>>>>>>> sure would like to reuse existing stuff instead of inventing new
>>>>>>>> one.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Denny
>>>>>>>>
>>>>>>>> 2012/8/7 Ivan Herman <ivan@w3.org>:
>>>>>>>>>
>>>>>>>>> Denny,
>>>>>>>>>
>>>>>>>>> fwiw, the approach you take is very similar to what the Provenance
>>>>>>>>> Working group took in the upcoming Prov vocabulary. Look, for
>>>>>>>>> example, in
>>>>>>>>>
>>>>>>>>> http://www.w3.org/TR/prov-primer/
>>>>>>>>>
>>>>>>>>> and for 'qualifiedXXXX'. Essentially, if there is a property 'p'
>>>>>>>>> then the
>>>>>>>>> 'qualifiedP' is another property whose range is an object of a
>>>>>>>>> specific type
>>>>>>>>> that has the other information. Same as our s:Population.
>>>>>>>>>
>>>>>>>>> As an aside, the good thing is that it may make it easier to
>>>>>>>>> use the
>>>>>>>>> provenance vocabulary in your setup if you want to:-)
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>> Ivan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Aug 6, 2012, at 12:03 , Denny Vrandečić wrote:
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> we have created the first draft of the Wikidata export in RDF.
>>>>>>>>>>
>>>>>>>>>> <http://meta.wikimedia.org/wiki/Wikidata/Development/RDF>
>>>>>>>>>>
>>>>>>>>>> I am inviting the Semantic Web and Linked Data community to a
>>>>>>>>>> discussion about it.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Denny
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Project director Wikidata
>>>>>>>>>> Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
>>>>>>>>>> Tel. +49-30-219 158 26-0 | http://wikimedia.de
>>>>>>>>>>
>>>>>>>>>> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens
>>>>>>>>>> e.V.
>>>>>>>>>> Eingetragen im Vereinsregister des Amtsgerichts
>>>>>>>>>> Berlin-Charlottenburg
>>>>>>>>>> unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
>>>>>>>>>> Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ----
>>>>>>>>> Ivan Herman, W3C Semantic Web Activity Lead
>>>>>>>>> Home: http://www.w3.org/People/Ivan/
>>>>>>>>> mobile: +31-641044153
>>>>>>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jun Zhao, PhD
>>>>>>> EPSRC Postdoctoral Fellow
>>>>>>> Department of Zoology
>>>>>>> University of Oxford
>>>>>>> Tinbergen Building, South Parks Road
>>>>>>> Oxford, OX1 3PS, UK
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Project director Wikidata
>>>>>> Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
>>>>>> Tel. +49-30-219 158 26-0 | http://wikimedia.de
>>>>>>
>>>>>> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens
>>>>>> e.V.
>>>>>> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
>>>>>> unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
>>>>>> Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Project director Wikidata
>>>> Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
>>>> Tel. +49-30-219 158 26-0 | http://wikimedia.de
>>>>
>>>> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
>>>> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
>>>> unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
>>>> Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
>>>>
>>>
>>
>>
>>
>
>


-- 
Dr. Markus Kroetzsch
Department of Computer Science, University of Oxford
Room 306, Parks Road, OX1 3QD Oxford, United Kingdom
+44 (0)1865 283529               http://korrekt.org/
Received on Wednesday, 15 August 2012 10:00:55 UTC