Re: [semanticweb] AW: AW: ANN: LOD Cloud - Statistics and compliance with best practices from Kingsley Idehen on 2010-10-22 (semantic-web@w3.org from October 2010)

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Fri, 22 Oct 2010 08:19:24 -0400
To: john.nj.davies@bt.com
CC: public-lod@w3.org, semantic-web@w3.org, semanticweb@yahoogroups.com
Message-ID: <4CC1814C.6090608@openlinksw.com>
On 10/22/10 6:48 AM, john.nj.davies@bt.com wrote:
>
> This article from the NYT may provide an amusing distraction from the 
> current discussion: I thought the powerpoint slide shown looked eerily 
> familiar ;-)
>
> http://www.nytimes.com/2010/04/27/world/27powerpoint.html?_r=1
>

LOL!

How poignant, really :-)


Kingsley
>
> John
>
> PS excellent post Denny IMHO
>
> *Dr John Davies*
> Chief Researcher
> Future Business Applications & Services
> BT Innovate & Design
> __________________________________________________
> Tel:    +44 1473 609583
> Email:    john.nj.davies@bt.com
>
> This email contains BT information, which may be privileged or 
> confidential.
> It's meant only for the individual(s) or entity named above. If you're 
> not the intended
> recipient, note that disclosing, copying, distributing or using this 
> information
> is prohibited. If you've received this email in error, please let me 
> know immediately
> on the email address above. Thank you.
> We monitor our email system, and may record your emails.
> British Telecommunications plc
> Registered office: 81 Newgate Street London EC1A 7AJ
> Registered in England no: 1800000
>
> *From:*semanticweb@yahoogroups.com 
> [mailto:semanticweb@yahoogroups.com] *On Behalf Of *Chris Bizer
> *Sent:* 22 October 2010 09:36
> *To:* 'Denny Vrandecic'; 'Martin Hepp'
> *Cc:* 'Kingsley Idehen'; 'public-lod'; 'Enrico Motta'; 'Thomas 
> Steiner'; 'Semantic Web'; 'Anja Jentzsch'; 'semanticweb'; 'Giovanni 
> Tummarello'; 'Mathieu d'Aquin'
> *Subject:* [semanticweb] AW: AW: ANN: LOD Cloud - Statistics and 
> compliance with best practices
>
> Hi Denny,
>
> thank you for your smart and insightful comments.
>
> > I also find it a shame, that this thread has been hijacked, especially
> since the
> > original topic was so interesting. The original email by Anja was not
> about the
> > LOD cloud, but rather about -- as the title of the thread still suggests
> -- the
> > compliance of LOD with some best practices. Instead of the question 
> "is X
> in
> > the diagram", I would much rather see a discussion on "are the selected
> > quality criteria good criteria? why are some of them so little followed?
> how
> > can we improve the situation?"
>
> Absolutely. Opening up the discussion on these topics is exactly the 
> reason
> why we compiled the statistics.
>
> In order to guide the discussion back to this topic, maybe it is useful to
> repost the original link:
>
> http://www4.wiwiss.fu-berlin.de/lodcloud/state/
>
> A quick initial comment concerning the term "quality criteria". I think it
> is essential to distinguish between:
>
> 1. The quality of the way data is published, meaning to which extend the
> publishers comply with best practices (a possible set of best practices is
> listed in the document)
> 2. The quality of the data itself. I think Enrico's comment was going into
> this direction.
>
> The Web of documents is an open system built on people agreeing on 
> standards
> and best practices.
> Open system means in this context that everybody can publish content and
> that there are no restrictions on the quality of the content.
> This is in my opinion one of the central facts that made the Web 
> successful.
>
> The same is true for the Web of Data. There obviously cannot be any
> restrictions on what people can/should publish (including, different
> opinions on a topic, but also including pure SPAM). As on the classic Web,
> it is a job of the information/data consumer to figure out which data it
> wants to believe and use (definition of information quality = 
> usefulness of
> information, which is a subjective thing).
>
> Thus it also does not make sense to discuss the "objective quality" of the
> data that should be included into the LOD cloud (objective quality 
> just does
> not exist) and it makes much more sense to discuss the mayor issues 
> that we
> are still having in regard to the compliance with publishing best 
> practices.
>
> > Anja has pointed to a wealth of openly
> > available numbers (no pun intended), that have not been discussed at 
> all.
> For
> > example, only 7.5% of the data source provide a mapping of "proprietary
> > vocabulary terms" to "other vocabulary terms". For anyone building
> > applications to work with LOD, this is a real problem.
>
> Yes, this is also the figure that scared me most.
>
> > but in order to figure out what really needs to be done, and
> > how the criteria for good data on the Semantic Web need to look like, we
> > need to get back to Anja's original questions. I think that is a 
> question
> we
> > may try to tackle in Shanghai in some form, I at least would find 
> that an
> > interesting topic.
>
> Same with me.
> Shanghai was also the reason for the timing of the post.
>
> Cheers,
>
> Chris
>
> > -----Ursprüngliche Nachricht-----
> > Von: semantic-web-request@w3.org 
> <mailto:semantic-web-request%40w3.org> [mailto:semantic-web-
> > request@w3.org <mailto:request%40w3.org>] Im Auftrag von Denny Vrandecic
> > Gesendet: Freitag, 22. Oktober 2010 08:44
> > An: Martin Hepp
> > Cc: Kingsley Idehen; public-lod; Enrico Motta; Chris Bizer; Thomas
> Steiner;
> > Semantic Web; Anja Jentzsch; semanticweb; Giovanni Tummarello; Mathieu
> > d'Aquin
> > Betreff: Re: AW: ANN: LOD Cloud - Statistics and compliance with best
> > practices
> >
> > I usually dislike to comment on such discussions, as I don't find them
> > particularly productive, but 1) since the number of people pointing 
> me to
> > this thread is growing, 2) it contains some wrong statements, and 3) I
> feel
> > that this thread has been hijacked from a topic that I consider 
> productive
> and
> > important, I hope you won't mind me giving a comment. I wanted to 
> keep it
> > brief, but I failed.
> >
> > Let's start with the wrong statements:
> >
> > First, although I take responsibility as a co-creator for Linked Open
> Numbers,
> > I surely cannot take full credit for it. The dataset was a shared effort
> by a
> > number of people in Karlsruhe over a few days, and thus calling the 
> whole
> > thing "Denny's numbers dataset" is simply wrong due to the effort 
> spent by
> > my colleagues on it. It is fine to call it "Karlsruhe's numbers dataset"
> or simply
> > Linked Open Numbers, but providing me with the sole attribution is too
> > much of an honor.
> >
> > Second, although it is claimed that Linked Open Numbers are "by 
> design and
> > known to everybody in the core community, not data but noise", being one
> > of the co-designers of the system I have to disagree. It is "noise by
> design".
> > One of my motivations for LON was to raise a few points for discussion,
> and
> > at the same time provide with a dataset fully adhering to Linked 
> Open Data
> > principles. We were obviously able to get the first goal right, and we
> didn't do
> > too bad on the second, even though we got an interesting list of bugs by
> > Richard Cyganiak, which, pitily, we still did not fix. I am very 
> sorry for
> that.
> > But, to make the point very clear again, this dataset was designed to
> follow
> > LOD principles as good as possible, to be correct, and to have an
> > implementation that is so simple that we are usually up, so anyone 
> can use
> > LON as a testing ground. Due to a number of mails and personal
> > communications I know that LON has been used in that sense, and some
> > developers even found it useful for other features, like our 
> provision of
> > number names in several languages. So, what is called "noise by design"
> > here, is actually an actively used dataset, that managed to raise, as we
> have
> > hoped, discussions about the point of counting triples, was a factor in
> the
> > discussion about literals as subjects, made us rethink the notion of
> > "semantics" and computational properties of RDF entities in a different
> way,
> > and is involved in the discussion about quality of LOD. With respect to
> that, in
> > my opinion, LON has achieved and exceeded its expectations, but I
> > understand anyone who disagrees. Besides that, it was, and is, huge fun.
> >
> > Now to some topics of the discussion:
> >
> > On the issue of the LOD cloud diagram. I want to express my gratitude to
> all
> > the people involved, for the effort they voluntarily put in its
> development
> > and maintenance. I find it especially great, that it is becoming
> increasingly
> > transparent how the diagram is created and how the datasets are 
> selected.
> > Chris has refered to a set of conditions that are expected for 
> inclusion,
> and
> > before the creation of the newest iteration there was an explicit 
> call on
> this
> > mailing list to gather more information. I can only echo the sentiment
> that if
> > someone is unhappy with that diagram, they are free to create their own
> and
> > put it online. The data is available, the SVG is available and editable,
> and they
> > use licenses that allow the modification and republishing.
> >
> > Enrico is right that a system like Watson (or Sindice), that 
> automatically
> > gathers datasets from the Web instead of using a manually submitted and
> > managed catalog, will probably turn out to be the better approach. 
> Watson
> > used to have an overview with statistics on its current content, and I
> really
> > loved that overview, but this feature has been disabled since a few
> months.
> > If it was available, especially in any graphical format that can be 
> easily
> reused
> > in slides -- for example, graphs on the growth of number of triples,
> datasets,
> > etc., graphs on the change of cohesion, vocabulary reuse, etc. over 
> time,
> > within the Watson corpus -- I have no doubts that such graphs and data
> > would be widely reused, and would in many instances replace the current
> > usage of the cloud diagram. (I am furthermore curious about Enrico's
> > statement that the Semantic Web =/= Linked Open Data and wonder about
> > what he means here, but that is a completely different thread).
> >
> > Finally, to what I consider most important in this thread:
> >
> > I also find it a shame, that this thread has been hijacked, especially
> since the
> > original topic was so interesting. The original email by Anja was not
> about the
> > LOD cloud, but rather about -- as the title of the thread still suggests
> -- the
> > compliance of LOD with some best practices. Instead of the question 
> "is X
> in
> > the diagram", I would much rather see a discussion on "are the selected
> > quality criteria good criteria? why are some of them so little followed?
> how
> > can we improve the situation?" Anja has pointed to a wealth of openly
> > available numbers (no pun intended), that have not been discussed at 
> all.
> For
> > example, only 7.5% of the data source provide a mapping of "proprietary
> > vocabulary terms" to "other vocabulary terms". For anyone building
> > applications to work with LOD, this is a real problem.
> >
> > Whenever I was working on actual applications using LOD, I got
> disillusioned.
> > The current state of LOD is simply insufficient to sustain serious
> application
> > development on top of it. Current best practices (like follow-your-nose)
> are
> > theoretically sufficient, but not fully practical. To just give a few
> examples:
> > * imagine you get an RDF file with some 100 triples, including some 120
> > vocabulary terms. In order to actually display those, you need the label
> for
> > every single of these terms, preferably in the user's language. But most
> RDF
> > files do not provide such labels for terms they merely reference. In 
> order
> to
> > actually display them, we need to resolve all these 120 terms, i.e. we
> need to
> > make more than a hundred calls to the Web -- and we are only talking 
> about
> > the display of a single file! In Semantic MediaWiki we had, from the
> > beginning, made sure that all referenced terms are accompanied with some
> > minimum definition, providing labels, types, etc. which enables tools to
> at
> > least create a display quickly and then gather further data, but that
> practice
> > was not adopted. Nevermind the fact that language labels are 
> basically not
> > used for multi-linguality (check out Chapter 4 of my thesis for the 
> data,
> it's
> > devastating).
> > * URIs. Perfectly valid URIs like, e.g. used in Geonames, like
> > http://sws.geonames.org/3202326/ suddenly cause trouble, because their
> > serialization as a QName is, well, problematic.
> > * missing definitions. E.g. DBpedia has the properties
> > http://dbpedia.org/ontology/capital and
> > http://dbpedia.org/property/capital -- used in the very same file about
> the
> > same country. Resolving them will not help you at all to figure out how
> they
> > relate to each other. As a human I may make an educated guess, but for a
> > machine agent? And in this case we are talking about the *same* data
> > provider, nevermind cross-data-provider mapping.
> >
> > I could go on for a while -- and these are just examples *on top* of the
> > problems that Anja raises in her original post, and I am sure that
> everyone
> > who has actually used LOD from the wild has stumbled upon even more such
> > problems. She is raising here a very important point, for the practical
> > application of the data. But instead of discussing these issues that
> actually
> > matter, we talk about bubble graphs, that are created and maintained
> > voluntarily, and why a dataset is included or not, even though the
> criteria
> > have been made transparent and explicit. All these issues seriously 
> hamper
> > the uptake of usage of LOD and lead to the result that it is so much
> easier to
> > use dedicated, proprietary APIs in many cases.
> >
> > At one point it was stated that Chris' criteria were random and hard to
> fulfill
> > in certain cases. If you'd ask me, I would suggest much more draconian
> > criteria, in order to make data reuse as simple as we all envision. I
> really enjoy
> > the work of the pedantic web group with respect to this, providing
> validators
> > and guidelines, but in order to figure out what really needs to be done,
> and
> > how the criteria for good data on the Semantic Web need to look like, we
> > need to get back to Anja's original questions. I think that is a 
> question
> we
> > may try to tackle in Shanghai in some form, I at least would find 
> that an
> > interesting topic.
> >
> > Sorry again for the length of this rant, and I hope I have offended
> everyone
> > equally, I really tried not to single anyone out,
> > Denny
> >
> > P.S.: Finally, a major reason why I think I shouldn't have commented on
> this
> > thread is because it involves something I co-created, and thus I am 
> afraid
> it
> > impossible to stay unbiased. I consider constant advertising of your own
> > ideas tiring, impolite, and bound to lead to unproductive 
> discussions due
> to
> > emotional investment. If the work you do is good enough, you will find
> > champions for it. If not, improve it or do something else.
> >
> >
> >
> > On Oct 21, 2010, at 20:56, Martin Hepp wrote:
> >
> > > Hi all:
> > >
> > > I think that Enrico really made two very important points:
> > >
> > > 1. The LOD bubbles diagram has very high visibility inside and outside
> of the
> > community (up to the point that broad audiences believe the diagram 
> would
> > define relevance or quality).
> > >
> > > 2. Its creators have a special responsibility (in particular as
> scientists) to
> > maintain the diagram in a way that enhances insight and understanding,
> > rather than conveying false facts and confusing people.
> > >
> > > So Kingsley's argument that anybody could provide a better diagram 
> does
> > not really hold. It will harm the community as a whole, sooner or later,
> if the
> > diagram misses the point, simply based on the popularity of this 
> diagram.
> > >
> > > And to be frank, despite other design decisions, it is really 
> ridiculous
> that
> > Chris justifies the inclusion of Denny's numbers dataset as valid Linked
> Data,
> > because that dataset is, by design and known to everybody in the core
> > community, not data but noise.
> > >
> > > This is the "linked data landfill" mindset that I have kept on
> complaining
> > about. You make it very easy for others to discard the idea of 
> linked data
> as a
> > whole.
> > >
> > > Best
> > >
> > > Martin
> > >
> > >
>
> __._,_.___
>
> Reply to *sender* 
> <mailto:chris@bizer.de?subject=AW:%20AW:%20ANN:%20LOD%20Cloud%20-%20Statistics%20and%20compliance%20with%20best%20practices> 
> | Reply to *group* 
> <mailto:semanticweb@yahoogroups.com?subject=AW:%20AW:%20ANN:%20LOD%20Cloud%20-%20Statistics%20and%20compliance%20with%20best%20practices> 
> | Reply *via web post* 
> <http://groups.yahoo.com/group/semanticweb/post;_ylc=X3oDMTJwZWlpdHE4BF9TAzk3MzU5NzE0BGdycElkAzI3MjYyMjgEZ3Jwc3BJZAMxNzA1MDE2MDYxBG1zZ0lkAzUxMTIEc2VjA2Z0cgRzbGsDcnBseQRzdGltZQMxMjg3NzM2Mzk1?act=reply&messageNum=5112> 
> | *Start a New Topic* 
> <http://groups.yahoo.com/group/semanticweb/post;_ylc=X3oDMTJlOGNyOHNsBF9TAzk3MzU5NzE0BGdycElkAzI3MjYyMjgEZ3Jwc3BJZAMxNzA1MDE2MDYxBHNlYwNmdHIEc2xrA250cGMEc3RpbWUDMTI4NzczNjM5NQ--> 
>
>
> Messages in this topic 
> <http://groups.yahoo.com/group/semanticweb/message/5108;_ylc=X3oDMTM0NDhkNWJlBF9TAzk3MzU5NzE0BGdycElkAzI3MjYyMjgEZ3Jwc3BJZAMxNzA1MDE2MDYxBG1zZ0lkAzUxMTIEc2VjA2Z0cgRzbGsDdnRwYwRzdGltZQMxMjg3NzM2Mzk1BHRwY0lkAzUxMDg-> 
> (*3*)
>
> *Recent Activity:*
>
> ·*New Members 
> <http://groups.yahoo.com/group/semanticweb/members;_ylc=X3oDMTJmajBpdmIwBF9TAzk3MzU5NzE0BGdycElkAzI3MjYyMjgEZ3Jwc3BJZAMxNzA1MDE2MDYxBHNlYwN2dGwEc2xrA3ZtYnJzBHN0aW1lAzEyODc3MzYzOTU-?o=6>**2 
> *
>
> Visit Your Group 
> <http://groups.yahoo.com/group/semanticweb;_ylc=X3oDMTJlZzYyMmk5BF9TAzk3MzU5NzE0BGdycElkAzI3MjYyMjgEZ3Jwc3BJZAMxNzA1MDE2MDYxBHNlYwN2dGwEc2xrA3ZnaHAEc3RpbWUDMTI4NzczNjM5NQ--> 
>
>
> *MARKETPLACE*
>
> *Get great advice about dogs and cats. Visit the Dog & Cat Answers 
> Center. 
> <http://us.ard.yahoo.com/SIG=15opdcjro/M=493064.13814537.14041040.10835568/D=groups/S=1705016061:MKP1/Y=YAHOO/EXP=1287743596/L=0431752a-ddb7-11df-bb58-cbc5498e5478/B=qzBMIUwNO6E-/J=1287736396359462/K=cyxj9KyfSsHtr1vuvduVPg/A=6078812/R=0/SIG=114ae4ln1/*http:/dogandcatanswers.yahoo.com/>*
>
> **
>
> *
> ------------------------------------------------------------------------
> *
>
> *Hobbies & Activities Zone: Find others who share your passions! 
> Explore new interests. 
> <http://us.ard.yahoo.com/SIG=15oob6ufi/M=493064.14012770.13963757.13298430/D=groups/S=1705016061:MKP1/Y=YAHOO/EXP=1287743596/L=0431752a-ddb7-11df-bb58-cbc5498e5478/B=rDBMIUwNO6E-/J=1287736396359462/K=cyxj9KyfSsHtr1vuvduVPg/A=6015306/R=0/SIG=11vlkvigg/*http:/advision.webevents.yahoo.com/hobbiesandactivitieszone/>*
>
> **
>
> *
> ------------------------------------------------------------------------
> *
>
> *Stay on top of your group activity without leaving the page you're on 
> - Get the Yahoo! Toolbar now. 
> <http://us.ard.yahoo.com/SIG=15o8tbgiv/M=493064.13983314.14041046.13298430/D=groups/S=1705016061:MKP1/Y=YAHOO/EXP=1287743596/L=0431752a-ddb7-11df-bb58-cbc5498e5478/B=qjBMIUwNO6E-/J=1287736396359462/K=cyxj9KyfSsHtr1vuvduVPg/A=6060255/R=0/SIG=1194m4keh/*http:/us.toolbar.yahoo.com/?.cpdl=grpj>*
>
> **
>
> Yahoo! Groups 
> <http://groups.yahoo.com/;_ylc=X3oDMTJkMWVydXFqBF9TAzk3MzU5NzE0BGdycElkAzI3MjYyMjgEZ3Jwc3BJZAMxNzA1MDE2MDYxBHNlYwNmdHIEc2xrA2dmcARzdGltZQMxMjg3NzM2Mzk1>
>
> Switch to: Text-Only 
> <mailto:semanticweb-traditional@yahoogroups.com?subject=Change%20Delivery%20Format:%20Traditional>, 
> Daily Digest 
> <mailto:semanticweb-digest@yahoogroups.com?subject=Email%20Delivery:%20Digest> 
> . Unsubscribe 
> <mailto:semanticweb-unsubscribe@yahoogroups.com?subject=Unsubscribe> . 
> Terms of Use <http://docs.yahoo.com/info/terms/>
>
> .
>
> __,_._,___
>


-- 

Regards,

Kingsley Idehen	
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen
Received on Friday, 22 October 2010 12:19:58 UTC