AW: AW: ANN: LOD Cloud - Statistics and compliance with best practices from Chris Bizer on 2010-10-22 (semantic-web@w3.org from October 2010)

From: Chris Bizer <chris@bizer.de>
Date: Fri, 22 Oct 2010 10:35:35 +0200
To: "'Denny Vrandecic'" <denny.vrandecic@kit.edu>, "'Martin Hepp'" <martin.hepp@ebusiness-unibw.org>
Cc: "'Kingsley Idehen'" <kidehen@openlinksw.com>, "'public-lod'" <public-lod@w3.org>, "'Enrico Motta'" <e.motta@open.ac.uk>, "'Thomas Steiner'" <tsteiner@google.com>, "'Semantic Web'" <semantic-web@w3.org>, "'Anja Jentzsch'" <anja@anjeve.de>, "'semanticweb'" <semanticweb@yahoogroups.com>, "'Giovanni Tummarello'" <giovanni.tummarello@deri.org>, "'Mathieu d'Aquin'" <m.daquin@open.ac.uk>
Message-ID: <009901cb71c4$18dd0de0$4a9729a0$@bizer.de>
Hi Denny,

thank you for your smart and insightful comments.

> I also find it a shame, that this thread has been hijacked, especially
since the
> original topic was so interesting. The original email by Anja was not
about the
> LOD cloud, but rather about -- as the title of the thread still suggests
-- the
> compliance of LOD with some best practices. Instead of the question "is X
in
> the diagram", I would much rather see a discussion on "are the selected
> quality criteria good criteria? why are some of them so little followed?
how
> can we improve the situation?" 

Absolutely. Opening up the discussion on these topics is exactly the reason
why we compiled the statistics.

In order to guide the discussion back to this topic, maybe it is useful to
repost the original link:

http://www4.wiwiss.fu-berlin.de/lodcloud/state/

A quick initial comment concerning the term "quality criteria". I think it
is essential to distinguish between:

1. The quality of the way data is published, meaning to which extend the
publishers comply with best practices (a possible set of best practices is
listed in the document)
2. The quality of the data itself. I think Enrico's comment was going into
this direction.

The Web of documents is an open system built on people agreeing on standards
and best practices.
Open system means in this context that everybody can publish content and
that there are no restrictions on the quality of the content.
This is in my opinion one of the central facts that made the Web successful.

The same is true for the Web of Data. There obviously cannot be any
restrictions on what people can/should publish (including, different
opinions on a topic, but also including pure SPAM). As on the classic Web,
it is a job of the information/data consumer to figure out which data it
wants to believe and use (definition of information quality = usefulness of
information, which is a subjective thing). 

Thus it also does not make sense to discuss the "objective quality" of the
data that should be included into the LOD cloud (objective quality just does
not exist) and it makes much more sense to discuss the mayor issues that we
are still having in regard to the compliance with publishing best practices.

> Anja has pointed to a wealth of openly
> available numbers (no pun intended), that have not been discussed at all.
For
> example, only 7.5% of the data source provide a mapping of "proprietary
> vocabulary terms" to "other vocabulary terms". For anyone building
> applications to work with LOD, this is a real problem.

Yes, this is also the figure that scared me most.

> but in order to figure out what really needs to be done, and
> how the criteria for good data on the Semantic Web need to look like, we
> need to get back to Anja's original questions. I think that is a question
we
> may try to tackle in Shanghai in some form, I at least would find that an
> interesting topic.

Same with me. 
Shanghai was also the reason for the timing of the post.

Cheers,

Chris

> -----Ursprüngliche Nachricht-----
> Von: semantic-web-request@w3.org [mailto:semantic-web-
> request@w3.org] Im Auftrag von Denny Vrandecic
> Gesendet: Freitag, 22. Oktober 2010 08:44
> An: Martin Hepp
> Cc: Kingsley Idehen; public-lod; Enrico Motta; Chris Bizer; Thomas
Steiner;
> Semantic Web; Anja Jentzsch; semanticweb; Giovanni Tummarello; Mathieu
> d'Aquin
> Betreff: Re: AW: ANN: LOD Cloud - Statistics and compliance with best
> practices
> 
> I usually dislike to comment on such discussions, as I don't find them
> particularly productive,  but 1) since the number of people pointing me to
> this thread is growing, 2) it contains some wrong statements, and 3) I
feel
> that this thread has been hijacked from a topic that I consider productive
and
> important, I hope you won't mind me giving a comment. I wanted to keep it
> brief, but I failed.
> 
> Let's start with the wrong statements:
> 
> First, although I take responsibility as a co-creator for Linked Open
Numbers,
> I surely cannot take full credit for it. The dataset was a shared effort
by a
> number of people in Karlsruhe over a few days, and thus calling the whole
> thing "Denny's numbers dataset" is simply wrong due to the effort spent by
> my colleagues on it. It is fine to call it "Karlsruhe's numbers dataset"
or simply
> Linked Open Numbers, but providing me with the sole attribution is too
> much of an honor.
> 
> Second, although it is claimed that Linked Open Numbers are "by design and
> known to everybody in the core community, not data but noise", being one
> of the co-designers of the system I have to disagree. It is "noise by
design".
> One of my motivations for LON was to raise a few points for discussion,
and
> at the same time provide with a dataset fully adhering to Linked Open Data
> principles. We were obviously able to get the first goal right, and we
didn't do
> too bad on the second, even though we got an interesting list of bugs by
> Richard Cyganiak, which, pitily, we still did not fix. I am very sorry for
that.
> But, to make the point very clear again, this dataset was designed to
follow
> LOD principles as good as possible, to be correct, and to have an
> implementation that is so simple that we are usually up, so anyone can use
> LON as a testing ground. Due to a number of mails and personal
> communications I know that LON has been used in that sense, and some
> developers even found it useful for other features, like our provision of
> number names in several languages. So, what is called "noise by design"
> here, is actually an actively used dataset, that managed to raise, as we
have
> hoped, discussions about the point of counting triples, was a factor in
the
> discussion about literals as subjects, made us rethink the notion of
> "semantics" and computational properties of RDF entities in a different
way,
> and is involved in the discussion about quality of LOD. With respect to
that, in
> my opinion, LON has achieved and exceeded its expectations, but I
> understand anyone who disagrees. Besides that, it was, and is, huge fun.
> 
> Now to some topics of the discussion:
> 
> On the issue of the LOD cloud diagram. I want to express my gratitude to
all
> the people involved, for the effort they voluntarily put in its
development
> and maintenance. I find it especially great, that it is becoming
increasingly
> transparent how the diagram is created and how the datasets are selected.
> Chris has refered to a set of conditions that are expected for inclusion,
and
> before the creation of the newest iteration there was an explicit call on
this
> mailing list to gather more information. I can only echo the sentiment
that if
> someone is unhappy with that diagram, they are free to create their own
and
> put it online. The data is available, the SVG is available and editable,
and they
> use licenses that allow the modification and republishing.
> 
> Enrico is right that a system like Watson (or Sindice), that automatically
> gathers datasets from the Web instead of using a manually submitted and
> managed catalog, will probably turn out to be the better approach. Watson
> used to have an overview with statistics on its current content, and I
really
> loved that overview, but this feature has been disabled since a few
months.
> If it was available, especially in any graphical format that can be easily
reused
> in slides -- for example, graphs on the growth of number of triples,
datasets,
> etc., graphs on the change of cohesion, vocabulary reuse, etc. over time,
> within the Watson corpus -- I have no doubts that such graphs and data
> would be widely reused, and would in many instances replace the current
> usage of the cloud diagram. (I am furthermore curious about Enrico's
> statement that the Semantic Web =/= Linked Open Data and wonder about
> what he means here, but that is a completely different thread).
> 
> Finally, to what I consider most important in this thread:
> 
> I also find it a shame, that this thread has been hijacked, especially
since the
> original topic was so interesting. The original email by Anja was not
about the
> LOD cloud, but rather about -- as the title of the thread still suggests
-- the
> compliance of LOD with some best practices. Instead of the question "is X
in
> the diagram", I would much rather see a discussion on "are the selected
> quality criteria good criteria? why are some of them so little followed?
how
> can we improve the situation?" Anja has pointed to a wealth of openly
> available numbers (no pun intended), that have not been discussed at all.
For
> example, only 7.5% of the data source provide a mapping of "proprietary
> vocabulary terms" to "other vocabulary terms". For anyone building
> applications to work with LOD, this is a real problem.
> 
> Whenever I was working on actual applications using LOD, I got
disillusioned.
> The current state of LOD is simply insufficient to sustain serious
application
> development on top of it. Current best practices (like follow-your-nose)
are
> theoretically sufficient, but not fully practical. To just give a few
examples:
> * imagine you get an RDF file with some 100 triples, including some 120
> vocabulary terms. In order to actually display those, you need the label
for
> every single of these terms, preferably in the user's language. But most
RDF
> files do not provide such labels for terms they merely reference. In order
to
> actually display them, we need to resolve all these 120 terms, i.e. we
need to
> make more than a hundred calls to the Web -- and we are only talking about
> the display of a single file! In Semantic MediaWiki we had, from the
> beginning, made sure that all referenced terms are accompanied with some
> minimum definition, providing labels, types, etc. which enables tools to
at
> least create a display quickly and then gather further data, but that
practice
> was not adopted. Nevermind the fact that language labels are basically not
> used for multi-linguality (check out Chapter 4 of my thesis for the data,
it's
> devastating).
> * URIs. Perfectly valid URIs like, e.g. used in Geonames, like
> http://sws.geonames.org/3202326/ suddenly cause trouble, because their
> serialization as a QName is, well, problematic.
> * missing definitions. E.g. DBpedia has the properties
> http://dbpedia.org/ontology/capital and
> http://dbpedia.org/property/capital -- used in the very same file about
the
> same country. Resolving them will not help you at all to figure out how
they
> relate to each other. As a human I may make an educated guess, but for a
> machine agent? And in this case we are talking about the *same* data
> provider, nevermind cross-data-provider mapping.
> 
> I could go on for a while -- and these are just examples *on top* of the
> problems that Anja raises in her original post, and I am sure that
everyone
> who has actually used LOD from the wild has stumbled upon even more such
> problems. She is raising here a very important point, for the practical
> application of the data. But instead of discussing these issues that
actually
> matter, we talk about bubble graphs, that are created and maintained
> voluntarily, and why a dataset is included or not, even though the
criteria
> have been made transparent and explicit. All these issues seriously hamper
> the uptake of usage of LOD and lead to the result that it is so much
easier to
> use dedicated, proprietary APIs in many cases.
> 
> At one point it was stated that Chris' criteria were random and hard to
fulfill
> in certain cases. If you'd ask me, I would suggest much more draconian
> criteria, in order to make data reuse as simple as we all envision. I
really enjoy
> the work of the pedantic web group with respect to this, providing
validators
> and guidelines, but in order to figure out what really needs to be done,
and
> how the criteria for good data on the Semantic Web need to look like, we
> need to get back to Anja's original questions. I think that is a question
we
> may try to tackle in Shanghai in some form, I at least would find that an
> interesting topic.
> 
> Sorry again for the length of this rant, and I hope I have offended
everyone
> equally, I really tried not to single anyone out,
> Denny
> 
> P.S.: Finally, a major reason why I think I shouldn't have commented on
this
> thread is because it involves something I co-created, and thus I am afraid
it
> impossible to stay unbiased. I consider constant advertising of your own
> ideas tiring, impolite, and bound to lead to unproductive discussions due
to
> emotional investment. If the work you do is good enough, you will find
> champions for it. If not, improve it or do something else.
> 
> 
> 
> On Oct 21, 2010, at 20:56, Martin Hepp wrote:
> 
> > Hi all:
> >
> > I think that Enrico really made two very important points:
> >
> > 1. The LOD bubbles diagram has very high visibility inside and outside
of the
> community (up to the point that broad audiences believe the diagram would
> define relevance or quality).
> >
> > 2. Its creators have a special responsibility (in particular as
scientists) to
> maintain the diagram in a way that enhances insight and understanding,
> rather than conveying false facts and confusing people.
> >
> > So Kingsley's argument that anybody could provide a better diagram does
> not really hold. It will harm the community as a whole, sooner or later,
if the
> diagram misses the point, simply based on the popularity of this diagram.
> >
> > And to be frank, despite other design decisions, it is really ridiculous
that
> Chris justifies the inclusion of Denny's numbers dataset as valid Linked
Data,
> because that dataset is, by design and known to everybody in the core
> community, not data but noise.
> >
> > This is the "linked data landfill" mindset that I have kept on
complaining
> about. You make it very easy for others to discard the idea of linked data
as a
> whole.
> >
> > Best
> >
> > Martin
> >
> >
Received on Friday, 22 October 2010 08:33:31 UTC