Re: AW: ANN: LOD Cloud - Statistics and compliance with best practices from Juan Sequeda on 2010-10-22 (semantic-web@w3.org from October 2010)

From: Juan Sequeda <juanfederico@gmail.com>
Date: Fri, 22 Oct 2010 08:37:23 -0500
To: Chris Bizer <chris@bizer.de>
Cc: Denny Vrandecic <denny.vrandecic@kit.edu>, Martin Hepp <martin.hepp@ebusiness-unibw.org>, Kingsley Idehen <kidehen@openlinksw.com>, public-lod <public-lod@w3.org>, Enrico Motta <e.motta@open.ac.uk>, Thomas Steiner <tsteiner@google.com>, Semantic Web <semantic-web@w3.org>, Anja Jentzsch <anja@anjeve.de>, semanticweb <semanticweb@yahoogroups.com>, Giovanni Tummarello <giovanni.tummarello@deri.org>, "Mathieu d'Aquin" <m.daquin@open.ac.uk>
Message-ID: <AANLkTinOoRYJcuxhS2NRdLg7fJBFvaWxUiXkXXv1YO76@mail.gmail.com>
Denny,

I enjoyed reading your post!

On Fri, Oct 22, 2010 at 3:35 AM, Chris Bizer <chris@bizer.de> wrote:

> Hi Denny,
>
> thank you for your smart and insightful comments.
>
> > I also find it a shame, that this thread has been hijacked, especially
> since the
> > original topic was so interesting. The original email by Anja was not
> about the
> > LOD cloud, but rather about -- as the title of the thread still suggests
> -- the
> > compliance of LOD with some best practices. Instead of the question "is X
> in
> > the diagram", I would much rather see a discussion on "are the selected
> > quality criteria good criteria? why are some of them so little followed?
> how
> > can we improve the situation?"
>
> Absolutely. Opening up the discussion on these topics is exactly the reason
> why we compiled the statistics.
>
> In order to guide the discussion back to this topic, maybe it is useful to
> repost the original link:
>
> http://www4.wiwiss.fu-berlin.de/lodcloud/state/
>
> A quick initial comment concerning the term "quality criteria". I think it
> is essential to distinguish between:
>
> 1. The quality of the way data is published, meaning to which extend the
> publishers comply with best practices (a possible set of best practices is
> listed in the document)
> 2. The quality of the data itself. I think Enrico's comment was going into
> this direction.
>
> The Web of documents is an open system built on people agreeing on
> standards
> and best practices.
> Open system means in this context that everybody can publish content and
> that there are no restrictions on the quality of the content.
> This is in my opinion one of the central facts that made the Web
> successful.
>

+10000000000


> The same is true for the Web of Data. There obviously cannot be any
> restrictions on what people can/should publish (including, different
> opinions on a topic, but also including pure SPAM). As on the classic Web,
> it is a job of the information/data consumer to figure out which data it
> wants to believe and use (definition of information quality = usefulness of
> information, which is a subjective thing).
>

+10000000000

When people ask me, "but how trustworthy is the data?" I ask back, "and you
trust everything you read on the web?". Search engines offer a clean view of
the web to a user.

Anybody can say anything about anything (right Kingsley :)) so publishing
"bad" data is acceptable. The day that we start finding spam in LOD will be
a great day!!!!!




> Thus it also does not make sense to discuss the "objective quality" of the
> data that should be included into the LOD cloud (objective quality just
> does
> not exist) and it makes much more sense to discuss the mayor issues that we
> are still having in regard to the compliance with publishing best
> practices.
>
> > Anja has pointed to a wealth of openly
> > available numbers (no pun intended), that have not been discussed at all.
> For
> > example, only 7.5% of the data source provide a mapping of "proprietary
> > vocabulary terms" to "other vocabulary terms". For anyone building
> > applications to work with LOD, this is a real problem.
>
> Yes, this is also the figure that scared me most.
>
> > but in order to figure out what really needs to be done, and
> > how the criteria for good data on the Semantic Web need to look like, we
> > need to get back to Anja's original questions. I think that is a question
> we
> > may try to tackle in Shanghai in some form, I at least would find that an
> > interesting topic.
>
> Same with me.
> Shanghai was also the reason for the timing of the post.
>
> Cheers,
>
> Chris
>
> > -----Ursprüngliche Nachricht-----
> > Von: semantic-web-request@w3.org [mailto:semantic-web-
> > request@w3.org] Im Auftrag von Denny Vrandecic
> > Gesendet: Freitag, 22. Oktober 2010 08:44
> > An: Martin Hepp
> > Cc: Kingsley Idehen; public-lod; Enrico Motta; Chris Bizer; Thomas
> Steiner;
> > Semantic Web; Anja Jentzsch; semanticweb; Giovanni Tummarello; Mathieu
> > d'Aquin
> > Betreff: Re: AW: ANN: LOD Cloud - Statistics and compliance with best
> > practices
> >
> > I usually dislike to comment on such discussions, as I don't find them
> > particularly productive,  but 1) since the number of people pointing me
> to
> > this thread is growing, 2) it contains some wrong statements, and 3) I
> feel
> > that this thread has been hijacked from a topic that I consider
> productive
> and
> > important, I hope you won't mind me giving a comment. I wanted to keep it
> > brief, but I failed.
> >
> > Let's start with the wrong statements:
> >
> > First, although I take responsibility as a co-creator for Linked Open
> Numbers,
> > I surely cannot take full credit for it. The dataset was a shared effort
> by a
> > number of people in Karlsruhe over a few days, and thus calling the whole
> > thing "Denny's numbers dataset" is simply wrong due to the effort spent
> by
> > my colleagues on it. It is fine to call it "Karlsruhe's numbers dataset"
> or simply
> > Linked Open Numbers, but providing me with the sole attribution is too
> > much of an honor.
> >
> > Second, although it is claimed that Linked Open Numbers are "by design
> and
> > known to everybody in the core community, not data but noise", being one
> > of the co-designers of the system I have to disagree. It is "noise by
> design".
> > One of my motivations for LON was to raise a few points for discussion,
> and
> > at the same time provide with a dataset fully adhering to Linked Open
> Data
> > principles. We were obviously able to get the first goal right, and we
> didn't do
> > too bad on the second, even though we got an interesting list of bugs by
> > Richard Cyganiak, which, pitily, we still did not fix. I am very sorry
> for
> that.
> > But, to make the point very clear again, this dataset was designed to
> follow
> > LOD principles as good as possible, to be correct, and to have an
> > implementation that is so simple that we are usually up, so anyone can
> use
> > LON as a testing ground. Due to a number of mails and personal
> > communications I know that LON has been used in that sense, and some
> > developers even found it useful for other features, like our provision of
> > number names in several languages. So, what is called "noise by design"
> > here, is actually an actively used dataset, that managed to raise, as we
> have
> > hoped, discussions about the point of counting triples, was a factor in
> the
> > discussion about literals as subjects, made us rethink the notion of
> > "semantics" and computational properties of RDF entities in a different
> way,
> > and is involved in the discussion about quality of LOD. With respect to
> that, in
> > my opinion, LON has achieved and exceeded its expectations, but I
> > understand anyone who disagrees. Besides that, it was, and is, huge fun.
> >
> > Now to some topics of the discussion:
> >
> > On the issue of the LOD cloud diagram. I want to express my gratitude to
> all
> > the people involved, for the effort they voluntarily put in its
> development
> > and maintenance. I find it especially great, that it is becoming
> increasingly
> > transparent how the diagram is created and how the datasets are selected.
> > Chris has refered to a set of conditions that are expected for inclusion,
> and
> > before the creation of the newest iteration there was an explicit call on
> this
> > mailing list to gather more information. I can only echo the sentiment
> that if
> > someone is unhappy with that diagram, they are free to create their own
> and
> > put it online. The data is available, the SVG is available and editable,
> and they
> > use licenses that allow the modification and republishing.
> >
> > Enrico is right that a system like Watson (or Sindice), that
> automatically
> > gathers datasets from the Web instead of using a manually submitted and
> > managed catalog, will probably turn out to be the better approach. Watson
> > used to have an overview with statistics on its current content, and I
> really
> > loved that overview, but this feature has been disabled since a few
> months.
> > If it was available, especially in any graphical format that can be
> easily
> reused
> > in slides -- for example, graphs on the growth of number of triples,
> datasets,
> > etc., graphs on the change of cohesion, vocabulary reuse, etc. over time,
> > within the Watson corpus -- I have no doubts that such graphs and data
> > would be widely reused, and would in many instances replace the current
> > usage of the cloud diagram. (I am furthermore curious about Enrico's
> > statement that the Semantic Web =/= Linked Open Data and wonder about
> > what he means here, but that is a completely different thread).
> >
> > Finally, to what I consider most important in this thread:
> >
> > I also find it a shame, that this thread has been hijacked, especially
> since the
> > original topic was so interesting. The original email by Anja was not
> about the
> > LOD cloud, but rather about -- as the title of the thread still suggests
> -- the
> > compliance of LOD with some best practices. Instead of the question "is X
> in
> > the diagram", I would much rather see a discussion on "are the selected
> > quality criteria good criteria? why are some of them so little followed?
> how
> > can we improve the situation?" Anja has pointed to a wealth of openly
> > available numbers (no pun intended), that have not been discussed at all.
> For
> > example, only 7.5% of the data source provide a mapping of "proprietary
> > vocabulary terms" to "other vocabulary terms". For anyone building
> > applications to work with LOD, this is a real problem.
> >
> > Whenever I was working on actual applications using LOD, I got
> disillusioned.
> > The current state of LOD is simply insufficient to sustain serious
> application
> > development on top of it. Current best practices (like follow-your-nose)
> are
> > theoretically sufficient, but not fully practical. To just give a few
> examples:
> > * imagine you get an RDF file with some 100 triples, including some 120
> > vocabulary terms. In order to actually display those, you need the label
> for
> > every single of these terms, preferably in the user's language. But most
> RDF
> > files do not provide such labels for terms they merely reference. In
> order
> to
> > actually display them, we need to resolve all these 120 terms, i.e. we
> need to
> > make more than a hundred calls to the Web -- and we are only talking
> about
> > the display of a single file! In Semantic MediaWiki we had, from the
> > beginning, made sure that all referenced terms are accompanied with some
> > minimum definition, providing labels, types, etc. which enables tools to
> at
> > least create a display quickly and then gather further data, but that
> practice
> > was not adopted. Nevermind the fact that language labels are basically
> not
> > used for multi-linguality (check out Chapter 4 of my thesis for the data,
> it's
> > devastating).
> > * URIs. Perfectly valid URIs like, e.g. used in Geonames, like
> > http://sws.geonames.org/3202326/ suddenly cause trouble, because their
> > serialization as a QName is, well, problematic.
> > * missing definitions. E.g. DBpedia has the properties
> > http://dbpedia.org/ontology/capital and
> > http://dbpedia.org/property/capital -- used in the very same file about
> the
> > same country. Resolving them will not help you at all to figure out how
> they
> > relate to each other. As a human I may make an educated guess, but for a
> > machine agent? And in this case we are talking about the *same* data
> > provider, nevermind cross-data-provider mapping.
> >
> > I could go on for a while -- and these are just examples *on top* of the
> > problems that Anja raises in her original post, and I am sure that
> everyone
> > who has actually used LOD from the wild has stumbled upon even more such
> > problems. She is raising here a very important point, for the practical
> > application of the data. But instead of discussing these issues that
> actually
> > matter, we talk about bubble graphs, that are created and maintained
> > voluntarily, and why a dataset is included or not, even though the
> criteria
> > have been made transparent and explicit. All these issues seriously
> hamper
> > the uptake of usage of LOD and lead to the result that it is so much
> easier to
> > use dedicated, proprietary APIs in many cases.
> >
> > At one point it was stated that Chris' criteria were random and hard to
> fulfill
> > in certain cases. If you'd ask me, I would suggest much more draconian
> > criteria, in order to make data reuse as simple as we all envision. I
> really enjoy
> > the work of the pedantic web group with respect to this, providing
> validators
> > and guidelines, but in order to figure out what really needs to be done,
> and
> > how the criteria for good data on the Semantic Web need to look like, we
> > need to get back to Anja's original questions. I think that is a question
> we
> > may try to tackle in Shanghai in some form, I at least would find that an
> > interesting topic.
> >
> > Sorry again for the length of this rant, and I hope I have offended
> everyone
> > equally, I really tried not to single anyone out,
> > Denny
> >
> > P.S.: Finally, a major reason why I think I shouldn't have commented on
> this
> > thread is because it involves something I co-created, and thus I am
> afraid
> it
> > impossible to stay unbiased. I consider constant advertising of your own
> > ideas tiring, impolite, and bound to lead to unproductive discussions due
> to
> > emotional investment. If the work you do is good enough, you will find
> > champions for it. If not, improve it or do something else.
> >
> >
> >
> > On Oct 21, 2010, at 20:56, Martin Hepp wrote:
> >
> > > Hi all:
> > >
> > > I think that Enrico really made two very important points:
> > >
> > > 1. The LOD bubbles diagram has very high visibility inside and outside
> of the
> > community (up to the point that broad audiences believe the diagram would
> > define relevance or quality).
> > >
> > > 2. Its creators have a special responsibility (in particular as
> scientists) to
> > maintain the diagram in a way that enhances insight and understanding,
> > rather than conveying false facts and confusing people.
> > >
> > > So Kingsley's argument that anybody could provide a better diagram does
> > not really hold. It will harm the community as a whole, sooner or later,
> if the
> > diagram misses the point, simply based on the popularity of this diagram.
> > >
> > > And to be frank, despite other design decisions, it is really
> ridiculous
> that
> > Chris justifies the inclusion of Denny's numbers dataset as valid Linked
> Data,
> > because that dataset is, by design and known to everybody in the core
> > community, not data but noise.
> > >
> > > This is the "linked data landfill" mindset that I have kept on
> complaining
> > about. You make it very easy for others to discard the idea of linked
> data
> as a
> > whole.
> > >
> > > Best
> > >
> > > Martin
> > >
> > >
>
>
>
>
Received on Friday, 22 October 2010 13:43:19 UTC