Re: AW: ANN: LOD Cloud - Statistics and compliance with best practices from Denny Vrandecic on 2010-10-22 (semantic-web@w3.org from October 2010)

From: Denny Vrandecic <denny.vrandecic@kit.edu>
Date: Thu, 21 Oct 2010 23:43:53 -0700
To: Martin Hepp <martin.hepp@ebusiness-unibw.org>
Cc: Kingsley Idehen <kidehen@openlinksw.com>, public-lod <public-lod@w3.org>, Enrico Motta <e.motta@open.ac.uk>, Chris Bizer <chris@bizer.de>, Thomas Steiner <tsteiner@google.com>, Semantic Web <semantic-web@w3.org>, Anja Jentzsch <anja@anjeve.de>, semanticweb <semanticweb@yahoogroups.com>, Giovanni Tummarello <giovanni.tummarello@deri.org>, Mathieu d'Aquin <m.daquin@open.ac.uk>
Message-Id: <3E87ED8F-900D-4FC7-B369-68921E105479@kit.edu>
I usually dislike to comment on such discussions, as I don't find them particularly productive,  but 1) since the number of people pointing me to this thread is growing, 2) it contains some wrong statements, and 3) I feel that this thread has been hijacked from a topic that I consider productive and important, I hope you won't mind me giving a comment. I wanted to keep it brief, but I failed.

Let's start with the wrong statements:

First, although I take responsibility as a co-creator for Linked Open Numbers, I surely cannot take full credit for it. The dataset was a shared effort by a number of people in Karlsruhe over a few days, and thus calling the whole thing "Denny's numbers dataset" is simply wrong due to the effort spent by my colleagues on it. It is fine to call it "Karlsruhe's numbers dataset" or simply Linked Open Numbers, but providing me with the sole attribution is too much of an honor.

Second, although it is claimed that Linked Open Numbers are "by design and known to everybody in the core community, not data but noise", being one of the co-designers of the system I have to disagree. It is "noise by design". One of my motivations for LON was to raise a few points for discussion, and at the same time provide with a dataset fully adhering to Linked Open Data principles. We were obviously able to get the first goal right, and we didn't do too bad on the second, even though we got an interesting list of bugs by Richard Cyganiak, which, pitily, we still did not fix. I am very sorry for that. But, to make the point very clear again, this dataset was designed to follow LOD principles as good as possible, to be correct, and to have an implementation that is so simple that we are usually up, so anyone can use LON as a testing ground. Due to a number of mails and personal communications I know that LON has been used in that sense, and some developers even found it useful for other features, like our provision of number names in several languages. So, what is called "noise by design" here, is actually an actively used dataset, that managed to raise, as we have hoped, discussions about the point of counting triples, was a factor in the discussion about literals as subjects, made us rethink the notion of "semantics" and computational properties of RDF entities in a different way, and is involved in the discussion about quality of LOD. With respect to that, in my opinion, LON has achieved and exceeded its expectations, but I understand anyone who disagrees. Besides that, it was, and is, huge fun.

Now to some topics of the discussion:

On the issue of the LOD cloud diagram. I want to express my gratitude to all the people involved, for the effort they voluntarily put in its development and maintenance. I find it especially great, that it is becoming increasingly transparent how the diagram is created and how the datasets are selected. Chris has refered to a set of conditions that are expected for inclusion, and before the creation of the newest iteration there was an explicit call on this mailing list to gather more information. I can only echo the sentiment that if someone is unhappy with that diagram, they are free to create their own and put it online. The data is available, the SVG is available and editable, and they use licenses that allow the modification and republishing.

Enrico is right that a system like Watson (or Sindice), that automatically gathers datasets from the Web instead of using a manually submitted and managed catalog, will probably turn out to be the better approach. Watson used to have an overview with statistics on its current content, and I really loved that overview, but this feature has been disabled since a few months. If it was available, especially in any graphical format that can be easily reused in slides -- for example, graphs on the growth of number of triples, datasets, etc., graphs on the change of cohesion, vocabulary reuse, etc. over time, within the Watson corpus -- I have no doubts that such graphs and data would be widely reused, and would in many instances replace the current usage of the cloud diagram. (I am furthermore curious about Enrico's statement that the Semantic Web =/= Linked Open Data and wonder about what he means here, but that is a completely different thread).

Finally, to what I consider most important in this thread:

I also find it a shame, that this thread has been hijacked, especially since the original topic was so interesting. The original email by Anja was not about the LOD cloud, but rather about -- as the title of the thread still suggests -- the compliance of LOD with some best practices. Instead of the question "is X in the diagram", I would much rather see a discussion on "are the selected quality criteria good criteria? why are some of them so little followed? how can we improve the situation?" Anja has pointed to a wealth of openly available numbers (no pun intended), that have not been discussed at all. For example, only 7.5% of the data source provide a mapping of "proprietary vocabulary terms" to "other vocabulary terms". For anyone building applications to work with LOD, this is a real problem.

Whenever I was working on actual applications using LOD, I got disillusioned. The current state of LOD is simply insufficient to sustain serious application development on top of it. Current best practices (like follow-your-nose) are theoretically sufficient, but not fully practical. To just give a few examples:
* imagine you get an RDF file with some 100 triples, including some 120 vocabulary terms. In order to actually display those, you need the label for every single of these terms, preferably in the user's language. But most RDF files do not provide such labels for terms they merely reference. In order to actually display them, we need to resolve all these 120 terms, i.e. we need to make more than a hundred calls to the Web -- and we are only talking about the display of a single file! In Semantic MediaWiki we had, from the beginning, made sure that all referenced terms are accompanied with some minimum definition, providing labels, types, etc. which enables tools to at least create a display quickly and then gather further data, but that practice was not adopted. Nevermind the fact that language labels are basically not used for multi-linguality (check out Chapter 4 of my thesis for the data, it's devastating).
* URIs. Perfectly valid URIs like, e.g. used in Geonames, like http://sws.geonames.org/3202326/ suddenly cause trouble, because their serialization as a QName is, well, problematic.
* missing definitions. E.g. DBpedia has the properties http://dbpedia.org/ontology/capital and http://dbpedia.org/property/capital -- used in the very same file about the same country. Resolving them will not help you at all to figure out how they relate to each other. As a human I may make an educated guess, but for a machine agent? And in this case we are talking about the *same* data provider, nevermind cross-data-provider mapping.

I could go on for a while -- and these are just examples *on top* of the problems that Anja raises in her original post, and I am sure that everyone who has actually used LOD from the wild has stumbled upon even more such problems. She is raising here a very important point, for the practical application of the data. But instead of discussing these issues that actually matter, we talk about bubble graphs, that are created and maintained voluntarily, and why a dataset is included or not, even though the criteria have been made transparent and explicit. All these issues seriously hamper the uptake of usage of LOD and lead to the result that it is so much easier to use dedicated, proprietary APIs in many cases. 

At one point it was stated that Chris' criteria were random and hard to fulfill in certain cases. If you'd ask me, I would suggest much more draconian criteria, in order to make data reuse as simple as we all envision. I really enjoy the work of the pedantic web group with respect to this, providing validators and guidelines, but in order to figure out what really needs to be done, and how the criteria for good data on the Semantic Web need to look like, we need to get back to Anja's original questions. I think that is a question we may try to tackle in Shanghai in some form, I at least would find that an interesting topic.

Sorry again for the length of this rant, and I hope I have offended everyone equally, I really tried not to single anyone out,
Denny

P.S.: Finally, a major reason why I think I shouldn't have commented on this thread is because it involves something I co-created, and thus I am afraid it impossible to stay unbiased. I consider constant advertising of your own ideas tiring, impolite, and bound to lead to unproductive discussions due to emotional investment. If the work you do is good enough, you will find champions for it. If not, improve it or do something else.



On Oct 21, 2010, at 20:56, Martin Hepp wrote:

> Hi all:
> 
> I think that Enrico really made two very important points:
> 
> 1. The LOD bubbles diagram has very high visibility inside and outside of the community (up to the point that broad audiences believe the diagram would define relevance or quality).
> 
> 2. Its creators have a special responsibility (in particular as scientists) to maintain the diagram in a way that enhances insight and understanding, rather than conveying false facts and confusing people.
> 
> So Kingsley's argument that anybody could provide a better diagram does not really hold. It will harm the community as a whole, sooner or later, if the diagram misses the point, simply based on the popularity of this diagram.
> 
> And to be frank, despite other design decisions, it is really ridiculous that Chris justifies the inclusion of Denny's numbers dataset as valid Linked Data, because that dataset is, by design and known to everybody in the core community, not data but noise.
> 
> This is the "linked data landfill" mindset that I have kept on complaining about. You make it very easy for others to discard the idea of linked data as a whole.
> 
> Best
> 
> Martin
> 
>
Received on Friday, 22 October 2010 06:44:31 UTC