Dataset vocabularies vs. interchange vocabularies (was: Re: DBpedia 3.2 release, including DBpedia Ontology and RDF links to Freebase) from Richard Cyganiak on 2008-11-20 (public-lod@w3.org from November 2008)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Thu, 20 Nov 2008 01:34:07 +0000
To: Hugh Glaser <hg@ecs.soton.ac.uk>
Cc: public-lod@w3.org, Semantic Web <semantic-web@w3.org>, Chris Bizer <chris@bizer.de>, Georgi Kobilarov <gkob@gmx.de>
Message-Id: <065B5BAB-0210-4398-89D0-8BD6645D8F24@cyganiak.de>
On 17 Nov 2008, at 22:33, Hugh Glaser wrote:
> I am a bit uncomfortable with the idea of "you should use a:b from c  
> and d:e from f and g:h from i..."
> It makes for a fragmented view of my data, and might encourage me to  
> use things that do not capture exactly what I mean, as well as  
> introducing dependencies with things that might change, but over  
> which I have no control.
> So far better to use ontologies of type (b) where appropriate, and  
> define my own of type (a), which will (hopefully) be nicely  
> constructed, and easier to understand as smallish artefacts that can  
> be looked at as a whole.
> Of course, this means we need to crack the infrastructure that does  
> dynamic ontology mapping, etc.
> Mind you, unless we have the need, we are less likely to do so.
> I also think that the comments about the restrictions being a  
> characteristic of the dataset for type (a), but more like comments  
> on the world for type (b) are pretty good.

+1 on everything above.

I acknowledge that this is a minority POV at the moment.

The more common POV is: Re-use classes and properties from well- 
established vocabularies wherever you can. Don't invent your own terms  
unless you absolutely have to.

Interestingly, this somewhat echoes an old argument often heard in the  
days of the URI crisis a few years ago: “We must avoid a  
proliferation of URIs. We must avoid having lots of URIs for the same  
thing. Re-use other people's identifiers wherever you can. Don't  
invent your own unless you absolutely have to.”

I think that the emergence of linked data has shattered that argument.  
One of the key practices of linked data is: “Mint your own URIs  
when you publish new data. *Then* interlink it with other data by  
setting sameAs links to existing identifiers.”

The key insight is that linking yields many of the benefits of  
identifier re-use, while being much easier to manage due to the looser  
coupling.

That's for instance data. But a similar argument can be made for  
vocabularies: “Create your own terms when you publish a new  
dataset. *Then* interlink it with existing vocabularies by setting  
subclass and subproperty links.”

I'm not sure if this is *always* appropriate. But I do believe that  
there is nothing wrong with creating a vocabulary that is tailored to  
your dataset, and *not* intended or designed for re-use by anyone  
else. As long as you publish an RDFS/OWL description of your terms,  
and make an effort to include subclass/subproperty links to common  
vocabularies in it.

Coming back to the point I was trying to make below in the thread:  
Tailored, dataset-specific or site-specific vocabularies are one kind  
of beast; designed-for-reuse interchange vocabularies are another. The  
purpose of the second kind is to serve as common superclasses/ 
superproperties for the first kind, as linking hubs so to speak, to  
enable queries or UIs that work across datasets and sites.

I don't see a problem with including tight restrictions such as  
restrictive domain/range statements or cardinality constraints in  
dataset vocabularies, if one finds them helpful for consistency  
checking or dynamic UIs. But in interchange vocabularies, tight  
restrictions hurt reusability, so it's usually better to go very easy  
on the harder RDFS and OWL features.

Best,
Richard


>
> Hugh
>
> On 17/11/2008 20:09, "Richard Cyganiak" <richard@cyganiak.de> wrote:
>
>
>
> John,
>
> Here's an observation from a bystander ...
>
> On 17 Nov 2008, at 17:17, John Goodwin wrote:
> <snip>
>> This is also a good example of where (IMHO) the domain was perhaps
>> over specified. For example all sorts of things could have
>> publishers, and not the ones listed here. I worry that if you reuse
>> DBpedia "publisher" elsewhere you could get some undesired  
>> inferences.
>
> But are the DBpedia classes *intended* for re-use elsewhere? Or do
> they simply express restrictions that apply *within DBpedia*?
>
> I think that in general it is useful to distinguish between two
> different kinds of ontologies:
>
> a) Ontologies that express restrictions that are present in a certain
> dataset. They simply express what's there in the data. In this sense,
> they are like database schemas: If "Publisher" has a range of
> "Person", then it means that the publisher *in this particular
> dataset* is always a person. That's not an assertion about the world,
> it's an assertion about the dataset. These ontologies are usually not
> very re-usable.
>
> b) Ontologies that are intended as a "lingua franca" for data exchange
> between different applications. They are designed for broad re-use,
> and thus usually do not add many restrictions. In this sense, they are
> more like controlled vocabularies of terms. Dublin Core is probably
> the prototypical example, and FOAF is another good one. They usually
> don't allow as many interesting inferences.
>
> I think that these two kinds of ontologies have very different
> requirements. Ontologies that are designed for one of these roles are
> quite useless if used for the other job. Ontologies that have not been
> designed for either of these two roles usually fail at both.
>
> Returning to DBpedia, my impression is that the DBpedia ontology is
> intended mostly for the first role. Maybe it should be understood more
> as a schema for the DBpedia dataset, and not so much as a re-usable
> set of terms for use outside of the Wikipedia context. (I might be
> wrong, I was not involved in its creation.)
>
> Richard
>
Received on Thursday, 20 November 2008 01:34:50 UTC