Review of LLD vocabularies and datasets from Bernard Vatant on 2011-06-21 (public-lld@w3.org from June 2011)

From: Bernard Vatant <bernard.vatant@mondeca.com>
Date: Tue, 21 Jun 2011 15:52:06 +0200
To: public-lld@w3.org
Cc: "emmanuelle.bermes" <emmanuelle.bermes@bnf.fr>
Message-ID: <BANLkTi=di9ZuVgeNgGCw33QqShWzxj3-7Q@mail.gmail.com>

Hello all

Emmanuelle has asked me to review the draft currently at
http://www.w3.org/2005/Incubator/lld/wiki/Vocabulary_and_Dataset with a
"fresh eye".
Here are a few comments.

Preliminary question : what is the main target of this document? to give
linked data community the opportunity of understanding the specific
viewpoint, resources and terminology used by the Library community? or to
help Library people to enter the linked data universe? or both? Other?

General structure of the document : The introduction defines element sets
first, then value vocabularies and finally datasets. But the rest of the
document presents examples the other way round, first datasets, then value
vocabularies, then element sets. Why such an inversion, apart from the
stylistic beauty of chiasmus?

I keep being puzzled by the use of "element sets" and "value vocabularies"
terminology. I must say that the first one in particular, for someone with
background in maths, sounds like a very strange tautology (a set is made of
elements by definition). Since this terminology has been discussed ad
nauseam, I suppose it does make sense for the Library community. I always
had the same feeling with Dublin Core "elements" anyway.

As for "datasets" : in the general linked data world, a dataset is simply a
consistent set of triples that you can query or download from a specific
point. It's a technical, applicative definition, so it's orthogonal to the
distinction between T-Box and A-Box (aka element sets and value
vocabularies). Actually the distinction between metadata and data does not
make much sense in the linked data universe. It's a continuum of
information, and "it's triples all the way down".

In particular as soon as CKAN is introduced the distinction between "value
vocabularies" and "datasets" is blurred, since in CKAN packages there is no
such distinction. Moreover, in the illustrative diagram, bubbles are either
proper datasets (in the sense defined in the introduction) or value
vocabularies. This does not help to clarify the distinction made in the
introduction.

To go down to an example, many people will find strange to find Geonames,
DBpedia or Freebase defined as "value vocabularies". In fact in Geonames for
example there is a "value vocabulary" of feature classes and codes, actually
included technically along with the geonames "element set" in the so-called
"geonames ontology" at http://www.geonames.org/ontology.
The dataset of individual geonames "features" (geographical entities) is
more an authority list like VIAF.

So I would suggest to sort the list of "value vocabularies" into
thesauri/classifications/subject headings on one side, and authority files
on the other. And maybe make a distinction between resourcs developed in the
library community framework, using state-of-the art methods of this
community, from the crowd-sourced resources such as DBpedia, Freebase, or
DBpedia.

Best

Bernard


-- 
Bernard Vatant
Senior Consultant
Vocabulary & Data Integration
Tel:       +33 (0) 971 488 459
Mail:     bernard.vatant@mondeca.com
----------------------------------------------------
Mondeca
3, cité Nollez 75018 Paris France
Web:    http://www.mondeca.com
Blog:    http://mondeca.wordpress.com
----------------------------------------------------

Received on Tuesday, 21 June 2011 13:52:34 UTC