Re: [ANN] LOD from Italian National Research Council from Kingsley Idehen on 2011-02-09 (public-lod@w3.org from February 2011)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Wed, 09 Feb 2011 17:31:44 -0500
To: bvilla@delicias.dia.fi.upm.es
CC: Aldo Gangemi <aldo.gangemi@cnr.it>, enrico.daga@cnr.it, Hugh Glaser <hg@ecs.soton.ac.uk>, Linked Data community <public-lod@w3.org>, Alberto Salvati <alberto.salvati@cnr.it>
Message-ID: <4D5315D0.5080604@openlinksw.com>
On 2/9/11 5:01 PM, Boris Villazón Terrazas wrote:
> Hi Aldo et al.
>
> Nice stuff! ;-)
>
> Regarding your question, I can tell you what we did within the context 
> of GeoLinkedData [1]
> We have separated the model/vocabulary from the data. So we have the 
> model/vocabulary in a Named Graph and the data in other Named Graph.
> According to Kingsley it seems to be we are going to the right 
> direction .... Thanks Kingsley
>
> We have similar cases as you have, for example
> geoes:Provincia rdfs:subClassOf fao:territory [2]
> and a particular resource of type geoes:Provincia can be geores:Madrid 
> [3]
>
> We only materialize the instances of geoes:Provincia, we do not for 
> fao:territory.
> If I understand correctly to what Kingsley says, we can conditionally 
> apply inference rules via SPARQL, so our alignment triples become 
> conditional and matrialize for query evaluation purposes only, using 
> Virtuoso, right? This is something that we have to check.

Yes!

Kingsley
>
> Best
>
> Boris
>
> P.S. BTW you can generate your sitemap files from your sparql endpoint 
> using sitemap4rdf [4] and then submit them to Google and Sindice ;-)
>
>
> [1] http://geo.linkeddata.es/
> [2]  http://geo.linkeddata.es/ontology/Provincia
> [3] http://geo.linkeddata.es/resource/Provincia/Madrid
> [4] http://lab.linkeddata.deri.ie/2010/sitemap4rdf/
>
> On 09/02/2011 19:17, Kingsley Idehen wrote:
>> On 2/9/11 12:57 PM, Enrico Daga wrote:
>>> Dear Hugh, Kingsley, all,
>>>
>>> thank you both for your hints.
>>>
>>>>> Where the dataset owner agrees that, for example, dct:creator 
>>>>> aligns with
>>>>> pubblicazioni:autore, then perhaps you can.
>>>>> Of course, there is a question about why dct:creator was not used 
>>>>> in the
>>>>> first place, but it can be neat to simply use all your own 
>>>>> properties, so
>>>>> that's OK.
>>>>> But if the alignment is to go in the dataset, it should be part of 
>>>>> the
>>>>> knowledge capture process, not added by a third party.
>>> In our case we are the maintainer of the dataset and ontology so we
>>> know that a triple like
>>>
>>> pubblicazioni:autore rdfs:subClassOf dct:creator
>>>
>>> is correct.
>>
>> Yes, but "correct" is one of those subjective things when working at 
>> InterWeb scale. Thus, its important that you partition your data 
>> using Named Graphs rather than work with a single graph.
>>
>> The approach above allows you to see things as you seek, while 
>> letting others do the same via their specific "context lenses".
>>
>>> Our point is mainly related to make the dataset easily reusable by the
>>> means of shared vocabularies even if those commonly known names have
>>> not been used in the process of dataset generation.
>>
>> Yes, but if you have the vocabulary triples (TBox) in a separate name 
>> graph you're fine. Or you can leave everything in your main graph, 
>> but place any inter vocabulary mapping triples in a separate Named 
>> Graph.
>>
>>> We want our data to be self-explained providing suggested alignments
>>> between our internal vocabulary and public ones, at least for very
>>> common cases, such "abstract:Titolo" and "dc:title", for example.
>>
>> Yes, this is all clear. The key is to partition your data, there's no 
>> downside bar deflection of barbs from those who see things 
>> differently due to their specific "context lenses" when dealing with 
>> your data.
>>
>>>>> So if a consumer of the data wanted to assert
>>>>> cnr:coauthor rdfs:subPropertyOf foaf:knows
>>>>> that is up to them and would be fine, but to enforce it seems not 
>>>>> good to
>>>>> me.
>>> Yes, in this case the alignment implies additional assumptions, but in
>>> principle we need this (maybe not exactly that...) to describe the
>>> dataset to non-cnr people.
>>
>> Again, that's fine, put the triples in a separate Named Graph. It 
>> won't adversely affect anything.
>>
>>>> In a nutshell, put the controversial stuff in its own Named Graph 
>>>> within
>>>> your Virtuoso instance. When making Linked Data Resources (e.g. 
>>>> HTML browser
>>>> pages) you can scope your SPARQL DESCRIBES or CONSTRUCTs to the 
>>>> main Graph
>>>> (the one without an alignment triples etc..). The SPARQL endpoint 
>>>> stays as
>>>> the open ended access point to all data.
>>> So you suggest to use a separate graph, not involved in content
>>> negotiation but accessible through the sparql endpoint.
>>
>> I mean:
>>
>> 1. Your HTML pages (which use content negotiation and SPARQL DESCRIBE 
>> or CONSTRUCTS) to make Description Page can be scoped to the entire 
>> quad store or specific Named Graphs
>>
>> 2. SPARQL endpoint is always open for people to query the entire 
>> collection of graphs or specific Named Graph combos.
>>
>> You have to decide how you want to project your world view to the 
>> public. Bottom line, the public always has a SPARQL endpoint to they 
>> can apply their specific "context lenses" assuming you choose to have 
>> you world view (including cross vocabulary mappings) exposed in your 
>> Linked Data pages.
>>
>>> This solution could be good, but brings more/new questions.
>>> Let's say we create a new dataset<http://data.cnr.it/alignments>,
>>> what should it return?
>>> 1) alignments at the schema level between the CNR ontology and public
>>> well-known vocabularies, triples like "pubblicazioni:autore
>>> rdfs:subClassOf dct:creator"
>>> 2) the above plus materialized triples, for example:
>>>
>>>>>> cnrdata:AldoGangemi foaf:knows cnrdata:EnricoDaga
>>>>>> cnrdata:AldoGangemi rdf:type foaf:Person
>>> The first would leave the interpretation of the alignment to the
>>> client application, the second would duplicate knowledge, leading to
>>> maintainability issues (at least in the long term).
>>
>> Remember, when using Virtuoso you can conditionally apply inference 
>> rules via SPARQL, so your alignment triples become conditional and 
>> matrialize for query evaluation purposes only, when you leverage this 
>> aspect of Virtuoso. You don't have to fully materialize these triples.
>>
>>> Another point is, if we choose solution (1) (only vocabulary
>>> alignments and no data)
>>> - how we formally (where and with which vocabulary) connect the
>>> dataset to its alignment?
>>> - how machines would learn that my vocabulary, in some part, could be
>>> interpreted as a variation of a more common set of terms?
>>
>> You can make all kinds of Linked Data description pages re. Virtuoso, 
>> maybe take a look at this Linked Data Deployment in 3 steps guide [1] 
>> to get a feel for how simple this has become.
>>
>> Links:
>>
>> 1. 
>> http://www.openlinksw.com/dataspace/kidehen@openlinksw.com/weblog/kidehen@openlinksw.com%27s%20BLOG%20%5B127%5D/1642 
>> -- how to simply load data into Virtuoso and start using Linked Data 
>> pages without hassles. Makes my comments clearer once you play around 
>> at bit.
>>
>> Kingsley
>>> Bests
>>>
>>> Enrico
>>>
>>> On 9 February 2011 16:12, Kingsley Idehen<kidehen@openlinksw.com>  
>>> wrote:
>>>> On 2/9/11 8:33 AM, Hugh Glaser wrote:
>>>>> Hi Aldo,
>>>>> Nice stuff.
>>>>> Regarding vocabulary alignment.
>>>>> I would suggest you might want to keep it out of your dataset.
>>>>> Vocabulary alignment is a matter of opinion; of course your 
>>>>> dataset is
>>>>> opinion as well, but it is the opinion of the organisation, 
>>>>> whereas the
>>>>> vocabulary alignment you talk about might be somebody else's opinion.
>>>>> Where the dataset owner agrees that, for example, dct:creator 
>>>>> aligns with
>>>>> pubblicazioni:autore, then perhaps you can.
>>>>> Of course, there is a question about why dct:creator was not used 
>>>>> in the
>>>>> first place, but it can be neat to simply use all your own 
>>>>> properties, so
>>>>> that's OK.
>>>>> But if the alignment is to go in the dataset, it should be part of 
>>>>> the
>>>>> knowledge capture process, not added by a third party.
>>>>>
>>>>> In fact, the example you choose is great.
>>>>> It is not at all clear to me that
>>>>> cnr:coauthor rdfs:subPropertyOf foaf:knows
>>>>> is actually what an organisation would want to say.
>>>>> Even with the loosest meaning of foaf:knows, there will be 
>>>>> co-authors who
>>>>> do not foaf:knows each other (certainly in some fields).
>>>>> And some people would be upset that their organisation was 
>>>>> publishing data
>>>>> stating that they did.
>>>>> (I just checked the latest edition of Nature, and the two articles 
>>>>> each
>>>>> have upwards of 50 authors from all over the world; I'm sure many 
>>>>> of them
>>>>> have never communicated with each other, apart from this article.)
>>>>> One of the advantages of using your own ontology is that you are 
>>>>> never
>>>>> saying anything other than what you meant (whatever that might be 
>>>>> :-) )
>>>>>
>>>>> So if a consumer of the data wanted to assert
>>>>> cnr:coauthor rdfs:subPropertyOf foaf:knows
>>>>> that is up to them and would be fine, but to enforce it seems not 
>>>>> good to
>>>>> me.
>>>>>
>>>>> And to help them you might provide a separate document with the 
>>>>> alignments
>>>>> in them, so that they can pick them up if they want.
>>>>> And our policy is to do exactly the same with the identity management
>>>>> thing as well, which is actually a similar problem (and I would be 
>>>>> happy to
>>>>> discuss how to do that with you, but I think we would need to go 
>>>>> off-list
>>>>> for that, as we have had many discussions on the list about it ;-) )
>>>>>
>>>>> I know I haven't tackled the technical issues much, which is what 
>>>>> you are
>>>>> asking, but I always start at the socio :-)
>>>> Aldo and colleagues,
>>>>
>>>> Congrats re. your project!
>>>>
>>>> In a nutshell, put the controversial stuff in its own Named Graph 
>>>> within
>>>> your Virtuoso instance. When making Linked Data Resources (e.g. 
>>>> HTML browser
>>>> pages) you can scope your SPARQL DESCRIBES or CONSTRUCTs to the 
>>>> main Graph
>>>> (the one without an alignment triples etc..). The SPARQL endpoint 
>>>> stays as
>>>> the open ended access point to all data.
>>>>
>>>> This area can get artificially confusing since DBMS architectures 
>>>> differ re.
>>>> SPARQL databases that support RDF resource import and query access. I
>>>> embarked on a somewhat similar exercise with @danbri last week re. 
>>>> DBpedia
>>>> and Open Archives Movies. In this case it wasn't about alignments 
>>>> per se.,
>>>> but the fundamental principles re. partitioning and scope control are
>>>> ultimately the same.
>>>>
>>>> Links:
>>>>
>>>> 1. http://danbri.org/words/2011/02/01/658 -- post by Danbri about the
>>>> exercise
>>>> 2. http://kingsley.idehen.net/c/GOK2B -- actual PivotViewer page 
>>>> (click on
>>>> "edit" to see the SPARQL behind and note how DBpedia and Danbri's 
>>>> Graphs are
>>>> joined)
>>>>
>>>> Kingsley
>>>>> Best
>>>>> Hugh
>>>>>
>>>>> On 9 Feb 2011, at 09:58, Aldo Gangemi wrote:
>>>>>
>>>>>> Dear all, we are happy to announce the release of the beta 
>>>>>> version of
>>>>>> data.cnr.it and the Semantic Scout exploratory browser.
>>>>>>
>>>>>> data.cnr.it [1] is the linked open data version of the scientific 
>>>>>> data
>>>>>> from the Italian National Research Council, and it includes 
>>>>>> researchers,
>>>>>> institutes, research programmes, publications, topics, etc.
>>>>>> A Virtuoso-powered SPARQL endpoint is available at [4]; a top-down
>>>>>> browser is available at [5]; a voiD description is at [6].
>>>>>>
>>>>>> The Semantic Scout [2] is an experimental exploratory browser 
>>>>>> applied to
>>>>>> the data.cnr.it datasets, cf. a paper published at EKAW2010 [3] 
>>>>>> for details.
>>>>>>
>>>>>> data.cnr.it and the Semantic Scout have been designed by the 
>>>>>> Semantic
>>>>>> Technology Lab ([7], see [8] for credits) that comprises semantic 
>>>>>> web
>>>>>> researchers and engineers from ISTC-CNR (the Institute of 
>>>>>> Cognitive Sciences
>>>>>> and Technologies of the Italian National Research Council), and 
>>>>>> from the
>>>>>> Information Systems Unit of the Italian National Research Council.
>>>>>>
>>>>>> We have used linked data principles, and the datasets are based on
>>>>>> modular, pattern-based designed OWL ontologies [9]. Data have been
>>>>>> triplified from multiple CNR databases, and enriched by means of OWL
>>>>>> reasoning (ABox materialization and classification), as well as 
>>>>>> by NLP and
>>>>>> graph mining techniques, e.g. the topics for the researchers have 
>>>>>> been
>>>>>> learnt by an automatic categorization system that uses 
>>>>>> researchers' textual
>>>>>> signatures (textual records) against the textual signature 
>>>>>> (pages) of
>>>>>> DBpedia categories.
>>>>>>
>>>>>> Current work is on integrating a more robust identity management 
>>>>>> and its
>>>>>> possible integration with Okkam, a deeper voiD description of the 
>>>>>> datasets,
>>>>>> entity linking to other LOD datasets (e.g. DBLP), more vocabulary 
>>>>>> alignment
>>>>>> (currently limited to FOAF, SKOS, and DC), etc.
>>>>>>
>>>>>> Regarding the last point, we are discussing the problem if 
>>>>>> vocabulary
>>>>>> alignment should be reflected or not in the datasets by means of
>>>>>> materialization. This problem has pervasive consequences on the 
>>>>>> size of the
>>>>>> services vs. datasets that enable linked data consumption: any 
>>>>>> help from the
>>>>>> community about pros and cons of either approaches? For example, 
>>>>>> if we
>>>>>> declare (schema level):
>>>>>>
>>>>>> cnr:coauthor rdfs:subPropertyOf foaf:knows
>>>>>> cnr:Researcher rdfs:subClassOf foaf:Person
>>>>>>
>>>>>> and we have e.g. in the data (*simplified names*):
>>>>>>
>>>>>> cnrdata:AldoGangemi cnr:coauthor cnrdata:EnricoDaga
>>>>>> cnrdata:AldoGangemi rdf:type cnr:Researcher
>>>>>>
>>>>>> should we materialize an additional dataset containing e.g.:
>>>>>>
>>>>>> cnrdata:AldoGangemi foaf:knows cnrdata:EnricoDaga
>>>>>> cnrdata:AldoGangemi rdf:type foaf:Person
>>>>>>
>>>>>> or should that be provided by a SPARQL endpoint under some 
>>>>>> entailment
>>>>>> regime?
>>>>>>
>>>>>> Consider that this is not only a matter of SPARQL efficiency vs. 
>>>>>> amount
>>>>>> of data, but also of data entanglement: e.g. when materialized, 
>>>>>> the topology
>>>>>> of linked datasets would be severely complicated by the 
>>>>>> mutityping of
>>>>>> individuals.
>>>>>>
>>>>>> Thanks for any advise (there not seems to be any best practice yet)
>>>>>> Ciao
>>>>>> Aldo, Enrico, Alberto
>>>>>>
>>>>>> [1] http://data.cnr.it
>>>>>> [2] http://bit/ly/semanticscout
>>>>>> [3] http://data.cnr.it/site/resources
>>>>>> [4] http://data.cnr.it/sparql/
>>>>>> [5] http://data.cnr.it/data/cnr/individuo/CNR
>>>>>> [6] http://data.cnr.it/data/http://data.cnr.it/dataset/
>>>>>> [7] http://stlab.istc.cnr.it
>>>>>> [8] http://data.cnr.it/site/contacts
>>>>>> [9] http://data.cnr.it/site/ontology
>>>>>>
>>>>>>
>>>>>> _____________________________________
>>>>>>
>>>>>> Aldo Gangemi
>>>>>> Senior Researcher
>>>>>> Semantic Technology Lab (STLab)
>>>>>> Institute for Cognitive Science and Technology,
>>>>>> National Research Council (ISTC-CNR)
>>>>>> Via Nomentana 56, 00161, Roma, Italy
>>>>>> Tel: +390644161535
>>>>>> Fax: +390644161513
>>>>>> aldo.gangemi@cnr.it
>>>>>> http://www.stlab.istc.cnr.it
>>>>>> http://www.istc.cnr.it/createhtml.php?nbr=71
>>>>>> skype aldogangemi
>>>>>> okkam ID: http://www.okkam.org/entity/ok200707031186131660596
>>>>>>
>>>>
>>>> -- 
>>>>
>>>> Regards,
>>>>
>>>> Kingsley Idehen
>>>> President&    CEO
>>>> OpenLink Software
>>>> Web: http://www.openlinksw.com
>>>> Weblog: http://www.openlinksw.com/blog/~kidehen
>>>> Twitter/Identi.ca: kidehen
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>


-- 

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Received on Wednesday, 9 February 2011 22:34:21 UTC