Re: Review of LLD vocabularies and datasets from Antoine Isaac on 2011-07-25 (public-lld@w3.org from July 2011)

From: Antoine Isaac <aisaac@few.vu.nl>
Date: Mon, 25 Jul 2011 21:59:28 +0200
To: Bernard Vatant <bernard.vatant@mondeca.com>
CC: public-lld@w3.org
Message-ID: <4E2DCB20.1060100@few.vu.nl>
Dear Bernard,

Thank you very much for your comments on the "Available vocabularies and datasets" deliverable, at
http://lists.w3.org/Archives/Public/public-lld/2011Jun/0048.html
This "fresh eye" look you've had is really useful for us :-)

We've already sent a first email as a first reaction (http://lists.w3.org/Archives/Public/public-lld/2011Jun/0072.html). We have since then made some changes, listed below, so as to address them to the best of our abilities.

  
> Preliminary question : what is the main target of this document? to give linked data community the opportunity of understanding the specific viewpoint, resources and terminology used by the Library community? or to help Library people to enter the linked data universe? or both? Other?


Good point. We've tried to make our position more explicit, adding:
"
This document also tries to provide the linked data community with an opportunity to understand the specific viewpoint, resources and terminology used by the Library community for their data, while helping Library people to get a grasp of the linked data notions corresponding to their own traditions.
"
in the intro.


> General structure of the document : The introduction defines element sets first, then value vocabularies and finally datasets. But the rest of the document presents examples the other way round, first datasets, then value vocabularies, then element sets. Why such an inversion, apart from the stylistic beauty of chiasmus?


Following yours and Monica's remark in her comments (http://lists.w3.org/Archives/Public/public-lld/2011Jun/0049.html), the order is now made coherent.


> I keep being puzzled by the use of "element sets" and "value vocabularies" terminology. I must say that the first one in particular, for someone with background in maths, sounds like a very strange tautology (a set is made of elements by definition). Since this terminology has been discussed ad nauseam, I suppose it does make sense for the Library community. I always had the same feeling with Dublin Core "elements" anyway.


Yes, this is legacy terminology that we can't really hide, if we want to bridge with initiatives such as Dublin Core. Note that the re-ordering of definitions, starting with datasets, puts the explanation for "element" more at the forefront: "where each statement consists of an element ("attribute" or "relationship") of the entity, and a "value" for that element".

Note also that we can't avoid the confusion with a mathematical reading, in the RDF world: "elements" (library-meaning) correspond to both classes and properties, which happen to be "elements" (maths term) of RDF vocabularies (when seeing them as sets of classes and properties).



> As for "datasets" : in the general linked data world, a dataset is simply a consistent set of triples that you can query or download from a specific point. It's a technical, applicative definition, so it's orthogonal to the distinction between T-Box and A-Box (aka element sets and value vocabularies). Actually the distinction between metadata and data does not make much sense in the linked data universe. It's a continuum of information, and "it's triples all the way down".


We have tried to acknowledge this bias of our document: our datasets are in fact "library-related metadata datasets".
Trying to keep it as simple as possible, we've re-phrased the first sentence of the definition as
"In this report we focus on datasets as collections of structured metadata".
We've also expanded the definition, noting that LD has a more general stance to what a "dataset" is, and hoping for convergence in the coming times:
"
Note that in the Linked Data context, Datasets do not necessarily consist of clearly identifiable "records". They are merely consistent set of triples that you can query or download from a specific point, without making a strict distinction between metadata and data. We expect this view to impact the way the library community conceive its own data, as (i) it creates or re-uses RDF vocabularies with domain and range settings and documentation that conforms to best practices, and (ii) more application cases emerge, where "traditional" descriptive metadata is being used together with other types of data.
"


> In particular as soon as CKAN is introduced the distinction between "value vocabularies" and "datasets" is blurred, since in CKAN packages there is no such distinction. Moreover, in the illustrative diagram, bubbles are either proper datasets (in the sense defined in the introduction) or value vocabularies. This does not help to clarify the distinction made in the introduction.


You are right. To try to alleviate the issue we've re-worded the paragraph, trying to use CKAN's notion of "package" more prominently.
http://www.w3.org/2005/Incubator/lld/wiki/index.php?title=Vocabulary_and_Dataset&diff=5338&oldid=5336


> To go down to an example, many people will find strange to find Geonames, DBpedia or Freebase defined as "value vocabularies". In fact in Geonames for example there is a "value vocabulary" of feature classes and codes, actually included technically along with the geonames "element set" in the so-called "geonames ontology" at http://www.geonames.org/ontology.
> The dataset of individual geonames "features" (geographical entities) is more an authority list like VIAF.


We've added "features" as a clarification for what we consider to be a "value vocabulary" in Geonames. Following Monica's recommendation, we also added examples in that part of the definition.
As for DBpedia and Freebase, there was already a "Note that Freebase is essentially a dataset, but its including many reference resource can lead to using some parts of it as value vocabularies for certain cases." in the specific paragraph for Freebase. We've added a similar explanation for DBpedia:
http://www.w3.org/2005/Incubator/lld/wiki/index.php?title=Vocabulary_and_Dataset&diff=5339&oldid=5338


> So I would suggest to sort the list of "value vocabularies" into thesauri/classifications/subject headings on one side, and authority files on the other. And maybe make a distinction between resourcs developed in the library community framework, using state-of-the art methods of this community, from the crowd-sourced resources such as DBpedia, Freebase, or DBpedia.


We have tried to implement this, at least partly--the effort is not an easy one. We hope this will answer your concern.

Please tell us if you can leave with these changes.

Thanks again for the very useful feedback,

Marcia, Antoine, Jeff, William
Received on Monday, 25 July 2011 19:58:33 UTC