- From: Juan Sequeda <juanfederico@gmail.com>
- Date: Fri, 22 Oct 2010 08:37:23 -0500
- To: Chris Bizer <chris@bizer.de>
- Cc: Denny Vrandecic <denny.vrandecic@kit.edu>, Martin Hepp <martin.hepp@ebusiness-unibw.org>, Kingsley Idehen <kidehen@openlinksw.com>, public-lod <public-lod@w3.org>, Enrico Motta <e.motta@open.ac.uk>, Thomas Steiner <tsteiner@google.com>, Semantic Web <semantic-web@w3.org>, Anja Jentzsch <anja@anjeve.de>, semanticweb <semanticweb@yahoogroups.com>, Giovanni Tummarello <giovanni.tummarello@deri.org>, "Mathieu d'Aquin" <m.daquin@open.ac.uk>
- Message-ID: <AANLkTinOoRYJcuxhS2NRdLg7fJBFvaWxUiXkXXv1YO76@mail.gmail.com>
Denny, I enjoyed reading your post! On Fri, Oct 22, 2010 at 3:35 AM, Chris Bizer <chris@bizer.de> wrote: > Hi Denny, > > thank you for your smart and insightful comments. > > > I also find it a shame, that this thread has been hijacked, especially > since the > > original topic was so interesting. The original email by Anja was not > about the > > LOD cloud, but rather about -- as the title of the thread still suggests > -- the > > compliance of LOD with some best practices. Instead of the question "is X > in > > the diagram", I would much rather see a discussion on "are the selected > > quality criteria good criteria? why are some of them so little followed? > how > > can we improve the situation?" > > Absolutely. Opening up the discussion on these topics is exactly the reason > why we compiled the statistics. > > In order to guide the discussion back to this topic, maybe it is useful to > repost the original link: > > http://www4.wiwiss.fu-berlin.de/lodcloud/state/ > > A quick initial comment concerning the term "quality criteria". I think it > is essential to distinguish between: > > 1. The quality of the way data is published, meaning to which extend the > publishers comply with best practices (a possible set of best practices is > listed in the document) > 2. The quality of the data itself. I think Enrico's comment was going into > this direction. > > The Web of documents is an open system built on people agreeing on > standards > and best practices. > Open system means in this context that everybody can publish content and > that there are no restrictions on the quality of the content. > This is in my opinion one of the central facts that made the Web > successful. > +10000000000 > The same is true for the Web of Data. There obviously cannot be any > restrictions on what people can/should publish (including, different > opinions on a topic, but also including pure SPAM). As on the classic Web, > it is a job of the information/data consumer to figure out which data it > wants to believe and use (definition of information quality = usefulness of > information, which is a subjective thing). > +10000000000 When people ask me, "but how trustworthy is the data?" I ask back, "and you trust everything you read on the web?". Search engines offer a clean view of the web to a user. Anybody can say anything about anything (right Kingsley :)) so publishing "bad" data is acceptable. The day that we start finding spam in LOD will be a great day!!!!! > Thus it also does not make sense to discuss the "objective quality" of the > data that should be included into the LOD cloud (objective quality just > does > not exist) and it makes much more sense to discuss the mayor issues that we > are still having in regard to the compliance with publishing best > practices. > > > Anja has pointed to a wealth of openly > > available numbers (no pun intended), that have not been discussed at all. > For > > example, only 7.5% of the data source provide a mapping of "proprietary > > vocabulary terms" to "other vocabulary terms". For anyone building > > applications to work with LOD, this is a real problem. > > Yes, this is also the figure that scared me most. > > > but in order to figure out what really needs to be done, and > > how the criteria for good data on the Semantic Web need to look like, we > > need to get back to Anja's original questions. I think that is a question > we > > may try to tackle in Shanghai in some form, I at least would find that an > > interesting topic. > > Same with me. > Shanghai was also the reason for the timing of the post. > > Cheers, > > Chris > > > -----Ursprüngliche Nachricht----- > > Von: semantic-web-request@w3.org [mailto:semantic-web- > > request@w3.org] Im Auftrag von Denny Vrandecic > > Gesendet: Freitag, 22. Oktober 2010 08:44 > > An: Martin Hepp > > Cc: Kingsley Idehen; public-lod; Enrico Motta; Chris Bizer; Thomas > Steiner; > > Semantic Web; Anja Jentzsch; semanticweb; Giovanni Tummarello; Mathieu > > d'Aquin > > Betreff: Re: AW: ANN: LOD Cloud - Statistics and compliance with best > > practices > > > > I usually dislike to comment on such discussions, as I don't find them > > particularly productive, but 1) since the number of people pointing me > to > > this thread is growing, 2) it contains some wrong statements, and 3) I > feel > > that this thread has been hijacked from a topic that I consider > productive > and > > important, I hope you won't mind me giving a comment. I wanted to keep it > > brief, but I failed. > > > > Let's start with the wrong statements: > > > > First, although I take responsibility as a co-creator for Linked Open > Numbers, > > I surely cannot take full credit for it. The dataset was a shared effort > by a > > number of people in Karlsruhe over a few days, and thus calling the whole > > thing "Denny's numbers dataset" is simply wrong due to the effort spent > by > > my colleagues on it. It is fine to call it "Karlsruhe's numbers dataset" > or simply > > Linked Open Numbers, but providing me with the sole attribution is too > > much of an honor. > > > > Second, although it is claimed that Linked Open Numbers are "by design > and > > known to everybody in the core community, not data but noise", being one > > of the co-designers of the system I have to disagree. It is "noise by > design". > > One of my motivations for LON was to raise a few points for discussion, > and > > at the same time provide with a dataset fully adhering to Linked Open > Data > > principles. We were obviously able to get the first goal right, and we > didn't do > > too bad on the second, even though we got an interesting list of bugs by > > Richard Cyganiak, which, pitily, we still did not fix. I am very sorry > for > that. > > But, to make the point very clear again, this dataset was designed to > follow > > LOD principles as good as possible, to be correct, and to have an > > implementation that is so simple that we are usually up, so anyone can > use > > LON as a testing ground. Due to a number of mails and personal > > communications I know that LON has been used in that sense, and some > > developers even found it useful for other features, like our provision of > > number names in several languages. So, what is called "noise by design" > > here, is actually an actively used dataset, that managed to raise, as we > have > > hoped, discussions about the point of counting triples, was a factor in > the > > discussion about literals as subjects, made us rethink the notion of > > "semantics" and computational properties of RDF entities in a different > way, > > and is involved in the discussion about quality of LOD. With respect to > that, in > > my opinion, LON has achieved and exceeded its expectations, but I > > understand anyone who disagrees. Besides that, it was, and is, huge fun. > > > > Now to some topics of the discussion: > > > > On the issue of the LOD cloud diagram. I want to express my gratitude to > all > > the people involved, for the effort they voluntarily put in its > development > > and maintenance. I find it especially great, that it is becoming > increasingly > > transparent how the diagram is created and how the datasets are selected. > > Chris has refered to a set of conditions that are expected for inclusion, > and > > before the creation of the newest iteration there was an explicit call on > this > > mailing list to gather more information. I can only echo the sentiment > that if > > someone is unhappy with that diagram, they are free to create their own > and > > put it online. The data is available, the SVG is available and editable, > and they > > use licenses that allow the modification and republishing. > > > > Enrico is right that a system like Watson (or Sindice), that > automatically > > gathers datasets from the Web instead of using a manually submitted and > > managed catalog, will probably turn out to be the better approach. Watson > > used to have an overview with statistics on its current content, and I > really > > loved that overview, but this feature has been disabled since a few > months. > > If it was available, especially in any graphical format that can be > easily > reused > > in slides -- for example, graphs on the growth of number of triples, > datasets, > > etc., graphs on the change of cohesion, vocabulary reuse, etc. over time, > > within the Watson corpus -- I have no doubts that such graphs and data > > would be widely reused, and would in many instances replace the current > > usage of the cloud diagram. (I am furthermore curious about Enrico's > > statement that the Semantic Web =/= Linked Open Data and wonder about > > what he means here, but that is a completely different thread). > > > > Finally, to what I consider most important in this thread: > > > > I also find it a shame, that this thread has been hijacked, especially > since the > > original topic was so interesting. The original email by Anja was not > about the > > LOD cloud, but rather about -- as the title of the thread still suggests > -- the > > compliance of LOD with some best practices. Instead of the question "is X > in > > the diagram", I would much rather see a discussion on "are the selected > > quality criteria good criteria? why are some of them so little followed? > how > > can we improve the situation?" Anja has pointed to a wealth of openly > > available numbers (no pun intended), that have not been discussed at all. > For > > example, only 7.5% of the data source provide a mapping of "proprietary > > vocabulary terms" to "other vocabulary terms". For anyone building > > applications to work with LOD, this is a real problem. > > > > Whenever I was working on actual applications using LOD, I got > disillusioned. > > The current state of LOD is simply insufficient to sustain serious > application > > development on top of it. Current best practices (like follow-your-nose) > are > > theoretically sufficient, but not fully practical. To just give a few > examples: > > * imagine you get an RDF file with some 100 triples, including some 120 > > vocabulary terms. In order to actually display those, you need the label > for > > every single of these terms, preferably in the user's language. But most > RDF > > files do not provide such labels for terms they merely reference. In > order > to > > actually display them, we need to resolve all these 120 terms, i.e. we > need to > > make more than a hundred calls to the Web -- and we are only talking > about > > the display of a single file! In Semantic MediaWiki we had, from the > > beginning, made sure that all referenced terms are accompanied with some > > minimum definition, providing labels, types, etc. which enables tools to > at > > least create a display quickly and then gather further data, but that > practice > > was not adopted. Nevermind the fact that language labels are basically > not > > used for multi-linguality (check out Chapter 4 of my thesis for the data, > it's > > devastating). > > * URIs. Perfectly valid URIs like, e.g. used in Geonames, like > > http://sws.geonames.org/3202326/ suddenly cause trouble, because their > > serialization as a QName is, well, problematic. > > * missing definitions. E.g. DBpedia has the properties > > http://dbpedia.org/ontology/capital and > > http://dbpedia.org/property/capital -- used in the very same file about > the > > same country. Resolving them will not help you at all to figure out how > they > > relate to each other. As a human I may make an educated guess, but for a > > machine agent? And in this case we are talking about the *same* data > > provider, nevermind cross-data-provider mapping. > > > > I could go on for a while -- and these are just examples *on top* of the > > problems that Anja raises in her original post, and I am sure that > everyone > > who has actually used LOD from the wild has stumbled upon even more such > > problems. She is raising here a very important point, for the practical > > application of the data. But instead of discussing these issues that > actually > > matter, we talk about bubble graphs, that are created and maintained > > voluntarily, and why a dataset is included or not, even though the > criteria > > have been made transparent and explicit. All these issues seriously > hamper > > the uptake of usage of LOD and lead to the result that it is so much > easier to > > use dedicated, proprietary APIs in many cases. > > > > At one point it was stated that Chris' criteria were random and hard to > fulfill > > in certain cases. If you'd ask me, I would suggest much more draconian > > criteria, in order to make data reuse as simple as we all envision. I > really enjoy > > the work of the pedantic web group with respect to this, providing > validators > > and guidelines, but in order to figure out what really needs to be done, > and > > how the criteria for good data on the Semantic Web need to look like, we > > need to get back to Anja's original questions. I think that is a question > we > > may try to tackle in Shanghai in some form, I at least would find that an > > interesting topic. > > > > Sorry again for the length of this rant, and I hope I have offended > everyone > > equally, I really tried not to single anyone out, > > Denny > > > > P.S.: Finally, a major reason why I think I shouldn't have commented on > this > > thread is because it involves something I co-created, and thus I am > afraid > it > > impossible to stay unbiased. I consider constant advertising of your own > > ideas tiring, impolite, and bound to lead to unproductive discussions due > to > > emotional investment. If the work you do is good enough, you will find > > champions for it. If not, improve it or do something else. > > > > > > > > On Oct 21, 2010, at 20:56, Martin Hepp wrote: > > > > > Hi all: > > > > > > I think that Enrico really made two very important points: > > > > > > 1. The LOD bubbles diagram has very high visibility inside and outside > of the > > community (up to the point that broad audiences believe the diagram would > > define relevance or quality). > > > > > > 2. Its creators have a special responsibility (in particular as > scientists) to > > maintain the diagram in a way that enhances insight and understanding, > > rather than conveying false facts and confusing people. > > > > > > So Kingsley's argument that anybody could provide a better diagram does > > not really hold. It will harm the community as a whole, sooner or later, > if the > > diagram misses the point, simply based on the popularity of this diagram. > > > > > > And to be frank, despite other design decisions, it is really > ridiculous > that > > Chris justifies the inclusion of Denny's numbers dataset as valid Linked > Data, > > because that dataset is, by design and known to everybody in the core > > community, not data but noise. > > > > > > This is the "linked data landfill" mindset that I have kept on > complaining > > about. You make it very easy for others to discard the idea of linked > data > as a > > whole. > > > > > > Best > > > > > > Martin > > > > > > > > > >
Received on Friday, 22 October 2010 13:43:19 UTC