W3C home > Mailing lists > Public > public-lod@w3.org > April 2009

AW: sanity checking the LOD Cloud statistics - Please add the statistics for your dataset to the Wiki

From: Chris Bizer <chris@bizer.de>
Date: Wed, 1 Apr 2009 14:30:09 +0200
To: <public-lod@w3.org>
Cc: "'Ted Thibodeau Jr'" <tthibodeau@openlinksw.com>
Message-ID: <00af01c9b2c5$98931fd0$c9b95f70$@de>
Hi Ted,

good that you raise this topic. 

The statistics were added to the wiki by Anja and reflect her
knowledge/guesses about the size of the datasets and the numbers of links
between them. And of course, some of her guesses might be wrong.  

In an ideal world, these statistics would be provided by Semantic Web search
engines that crawl the cloud and calculate the statistics afterwards based
on what they actually got from the Web. Alternatively, all dataset providers
could publish Void descriptions of their datasets which could also be used
to generate the statistics.

But as the search engines have not yet reached this point and as Void is
also not used by all data providers, we thought it would be useful to put
these statistics as a starting point into the Wiki so that people
(especially data set publishers) can update them and we can use them when we
draw the LOD cloud the next time.

I have updated the statistics about outgoing links connecting DBpedia with
other datasets yesterday. 

If everybody on this list would do the same for the data sources they
maintain/use, I think we will get a much more accurate LOD diagram the next
time we draw it.

So, please: Take 5 minutes and quickly add the actual statistics about your
datasets to

http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSet
s/Statistics
(size of your dataset)

http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSet
s/LinkStatistics
(number of links connecting your dataset with other datasets)

Thanks a lot in advance!

Cheers

Chris




> -----Ursprüngliche Nachricht-----
> Von: public-lod-request@w3.org [mailto:public-lod-request@w3.org] Im
> Auftrag von Ted Thibodeau Jr
> Gesendet: Mittwoch, 1. April 2009 08:06
> An: public-lod@w3.org
> Betreff: sanity checking the LOD Cloud statistics
> 
> Hello, all --
> 
> I've had a few minutes to start working to update my version [1] of the
> LOD Cloud diagram [2], which means I got to start looking at the Data
> Set Statistics [3] and Link Statistics [4] pages.
> 
> I have found a number of apparent discrepancies.  I'm not sure where
> these
> came from, but I think they need attention and correction.
> 
> [3] gave some round, and some exact values.  It's not at all clear
> whether
> these values were originally intended to reflect triple-counts in the
> data
> set, URIs minted there (i.e., Entities named there), or something else
> entirely.  I think the page holds a mix of these, which makes them
> rather
> troublesome as a source of comparison between data sets.
> 
> [4] had few exact values, which appear to have been incorrectly added
> there,
> and apparently means to use only 3 "counts" for the inter-set linkages
> --
> "> 100", "> 1000" "> 100.000".  Clearly, the last means more-than-one-
> hundred-thousand -- because the first clearly means more-than-one-
> hundred --
> but this was not obvious at first glance, given my US-training that the
> period is used for the decimal, not for the thousands delimiter.
> 
> First thing, therefor, I suggest that all period-delimiters on [4]
> change
> to comma-delimiters, to match the first page.  (I've actually made this
> change, but incorrect values may well remain -- please read on.)
> 
> I think it also makes sense to add "> 10,000", and "> 1,000,000" to the
> values here.  Just looking at the DBpedia "actual counts" which were on
> the page, it's clear that a log-scale comparing the interlinkage levels
> presents a better picture than the three arbitrarily chosen levels.
> (Again, I've started using these as relevant.)
> 
> 
> Now to the discrepancies.  From [3], I got this line --
> 
>     <http://dbtune.org/bbc/playcount/>   BBC Playcount Data      10,000
> 
> At first read, I thought that meant 10,000 triples.  But [4] indicated
> these external link counts for BBC Playcount Data --
> 
>     <http://www.bbc.co.uk/programmes>    BBC Programmes     > 100.000
>     <http://dbtune.org/musicbrainz>      Musicbrainz        > 100.000
> 
> I don't see a way for 10,000 triples to include 200,000 external links.
> That means that the first count must be of Entities.  But going to the
> BBC Playcount home page [5], I found --
> 
>     Triple count                        1,954,786
>     Distinct BBC Programmes resources       6,863
>     Distinct Musicbrainz resources          7,055
> 
> An obvious missing number here is a count of minted URIs -- that is, of
> BBC Playcount resources/entities -- but I also learned that BBC
> Playcount
> URIs are not pointers-to-values, but values-in-themselves.  The count
> is
> *embedded* in the URI (and thus, if a count changes, the URI changes!)
> --
> 
>     A playcount URI in this service looks like:
> 
>        http://dbtune.org/bbc/playcount/<id>_<k>
> 
>     Where <id> is the id of the episode or the brand, as in /
> programmes BBC
>     catalogue, and <k> is a number between 0 and the number of
> playcounts
>     for the episode or the brand.
> 
> If we accept this URI construction as reasonable (which I don't), it
> seems
> that <k> must actually be a "natural" or "counting" number (i.e., an
> integer
> greater than or equal to 1).  A value of 0 is nonsensical, as it would
> result
> in a Cartesian data set -- where each and every Musicbrainz resource
> gets
> a Playcount URI for each and every Programme resource -- and most of
> these
> Playcount URIs would have <k> = 0, for most Musicbrainz resources were
> not
> played in most Programmes.
> 
> Even if Zero-Play URIs are created only for those Musicbrainz resources
> which were played in *some* Programme, for those Programmes where they
> weren't played, far more URIs are created than are needed.
> 
> I'm hoping that the folks who built this data set are reading, and will
> consider restructuring it.  I'd suggest that the URI structure should
> be
> more like --
> 
>     http://dbtune.org/bbc/playcount/<id>_count
> 
> -- where <id> reflects *either* Programmes *or* Musicbrainz ID (this
> may
> mean further thinking, as I'm not directly familiar with these IDs, and
> Programmes may conflict with Musicbrainz), and the count (the *value*)
> is returned when the constructed URI is dereferenced.
> 
> 
> More baffling, and more troubling, on [3] I found --
> 
>     <http://ieee.rkbexplorer.com/>     IEEE     111
> 
> -- which purports to be linked out as follows --
> 
>     <http://acm.rkbexplorer.com/>        ACM                    > 1000
>     <http://eprints.rkbexplorer.com/>    eprints             > 100.000
>     <http://citeseer.rkbexplorer.com/>   CiteSeer            > 100.000
>     <http://dblp.rkbexplorer.com/>       DBLP RKB Explorer      > 1000
>     <http://laas.rkbexplorer.com/>       LAAS CNRS           > 100.000
> 
> Looking to primary sources again --
> 
>     Current statistics for this repository (ieee.rkbexplorer.com) —
> 
>        Last data assertion  2009-02-06 13:28:04
>        Number of triples    111442
>        Number of symbols    31552
>        Size of RDF dataset  8.2M
> 
>     Current statistics for the CRS for this repository
> (ieee.rkbexplorer.com) —
> 
>        Last data assertion   2009-03-25 16:52:19
>        Number of URIs        15142
>        Number of bundles     25410
>        of which active       4874
> 
> (Also according to this site, 'A CRS maintaines "bundles" of URIs
> which are
> deemed to be equivalent', which I presume means they are tied by
> owl:sameAs.)
> 
> It seems clear that the initial statistic was reported as thousands,
> and
> should be changed.  However, the out-links still don't add up.  300,000
> links (to eprints, CiteSeer, and LAAS CNRS) cannot be made with a third
> that many total triples.  A little more digging revealed [6] --
> 
>     DBLP RKB Explorer   dblp.rkbexplorer.com       5053 URIs
>     ACM                 acm.rkbexplorer.com        2511 URIs
>     CiteSeer            citeseer.rkbexplorer.com    888 URIs
>     eprints             eprints.rkbexplorer.com     602 URIs
>     LAAS CNRS           laas.rkbexplorer.com         93 URIs
> 
> -- (sorted by external URI counts, and trimmed to include only those
> external
> sets linked by more than 100 URIs, plus LAAS which apparently was
> incorrectly
> included in the existing list).
> 
> Clearly, whomever posted these values to the table read "100.000" to
> mean
> "one-hundred and zero-thousandths" rather than "one-hundred-thousand".
> 
> The correct information for IEEE appears to be --
> 
>     <http://ieee.rkbexplorer.com/>     IEEE     111,442
> 
> -- which purports to be linked out as follows --
> 
>     <http://acm.rkbexplorer.com/>        ACM                 > 1000
>     <http://eprints.rkbexplorer.com/>    eprints             >  100
>     <http://citeseer.rkbexplorer.com/>   CiteSeer            >  100
>     <http://dblp.rkbexplorer.com/>       DBLP RKB Explorer   > 1000
> 
> -- (and I've applied these corrections to [3] and [4]).
> 
> 
> How many other errors are there in this data?  And how much will those
> corrections change the diagrams based upon it?
> 
> I'll continue reviewing and correcting things, but thought you should
> all be aware that the current table and diagrams may be substantially
> incorrect.
> 
> Be seeing you,
> 
> Ted
> 
> 
> [1] <http://virtuoso.openlinksw.com/images/dbpedia-lod-cloud.html>
> [2] <http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-03-
> 05.html
>  >
> [3]
> <http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/D
> ataSets/Statistics
>  >
> [4]
> <http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/D
> ataSets/LinkStatistics
>  >
> [5] <http://dbtune.org/bbc/playcount/>
> [6] <http://ieee.rkbexplorer.com/crs/foreign.php>
> 
> 
> --
> A: Yes.                      http://www.guckes.net/faq/attribution.html
> | Q: Are you sure?
> | | A: Because it reverses the logical flow of conversation.
> | | | Q: Why is top posting frowned upon?
> 
> Ted Thibodeau, Jr.           //               voice +1-781-273-0900 x32
> Evangelism & Support         //        mailto:tthibodeau@openlinksw.com
> OpenLink Software, Inc.      //              http://www.openlinksw.com/
> 
> http://www.openlinksw.com/weblogs/uda/
> OpenLink Blogs              http://www.openlinksw.com/weblogs/virtuoso/
> 
> http://www.openlinksw.com/blog/~kidehen/
>      Universal Data Access and Virtual Database Technology Providers
> 
> 
Received on Wednesday, 1 April 2009 12:30:15 UTC

This archive was generated by hypermail 2.3.1 : Sunday, 31 March 2013 14:24:20 UTC