sanity checking the LOD Cloud statistics

Hello, all --

I've had a few minutes to start working to update my version [1] of the
LOD Cloud diagram [2], which means I got to start looking at the Data
Set Statistics [3] and Link Statistics [4] pages.

I have found a number of apparent discrepancies.  I'm not sure where  
these
came from, but I think they need attention and correction.

[3] gave some round, and some exact values.  It's not at all clear  
whether
these values were originally intended to reflect triple-counts in the  
data
set, URIs minted there (i.e., Entities named there), or something else
entirely.  I think the page holds a mix of these, which makes them  
rather
troublesome as a source of comparison between data sets.

[4] had few exact values, which appear to have been incorrectly added  
there,
and apparently means to use only 3 "counts" for the inter-set linkages  
--
"> 100", "> 1000" "> 100.000".  Clearly, the last means more-than-one-
hundred-thousand -- because the first clearly means more-than-one- 
hundred --
but this was not obvious at first glance, given my US-training that the
period is used for the decimal, not for the thousands delimiter.

First thing, therefor, I suggest that all period-delimiters on [4]  
change
to comma-delimiters, to match the first page.  (I've actually made this
change, but incorrect values may well remain -- please read on.)

I think it also makes sense to add "> 10,000", and "> 1,000,000" to the
values here.  Just looking at the DBpedia "actual counts" which were on
the page, it's clear that a log-scale comparing the interlinkage levels
presents a better picture than the three arbitrarily chosen levels.
(Again, I've started using these as relevant.)


Now to the discrepancies.  From [3], I got this line --

    <http://dbtune.org/bbc/playcount/>   BBC Playcount Data      10,000

At first read, I thought that meant 10,000 triples.  But [4] indicated
these external link counts for BBC Playcount Data --

    <http://www.bbc.co.uk/programmes>    BBC Programmes     > 100.000
    <http://dbtune.org/musicbrainz>      Musicbrainz        > 100.000

I don't see a way for 10,000 triples to include 200,000 external links.
That means that the first count must be of Entities.  But going to the
BBC Playcount home page [5], I found --

    Triple count                        1,954,786
    Distinct BBC Programmes resources       6,863
    Distinct Musicbrainz resources          7,055

An obvious missing number here is a count of minted URIs -- that is, of
BBC Playcount resources/entities -- but I also learned that BBC  
Playcount
URIs are not pointers-to-values, but values-in-themselves.  The count is
*embedded* in the URI (and thus, if a count changes, the URI changes!)  
--

    A playcount URI in this service looks like:

       http://dbtune.org/bbc/playcount/<id>_<k>

    Where <id> is the id of the episode or the brand, as in / 
programmes BBC
    catalogue, and <k> is a number between 0 and the number of  
playcounts
    for the episode or the brand.

If we accept this URI construction as reasonable (which I don't), it  
seems
that <k> must actually be a "natural" or "counting" number (i.e., an  
integer
greater than or equal to 1).  A value of 0 is nonsensical, as it would  
result
in a Cartesian data set -- where each and every Musicbrainz resource  
gets
a Playcount URI for each and every Programme resource -- and most of  
these
Playcount URIs would have <k> = 0, for most Musicbrainz resources were  
not
played in most Programmes.

Even if Zero-Play URIs are created only for those Musicbrainz resources
which were played in *some* Programme, for those Programmes where they
weren't played, far more URIs are created than are needed.

I'm hoping that the folks who built this data set are reading, and will
consider restructuring it.  I'd suggest that the URI structure should be
more like --

    http://dbtune.org/bbc/playcount/<id>_count

-- where <id> reflects *either* Programmes *or* Musicbrainz ID (this may
mean further thinking, as I'm not directly familiar with these IDs, and
Programmes may conflict with Musicbrainz), and the count (the *value*)
is returned when the constructed URI is dereferenced.


More baffling, and more troubling, on [3] I found --

    <http://ieee.rkbexplorer.com/>     IEEE     111

-- which purports to be linked out as follows --

    <http://acm.rkbexplorer.com/>        ACM                    > 1000
    <http://eprints.rkbexplorer.com/>    eprints             > 100.000
    <http://citeseer.rkbexplorer.com/>   CiteSeer            > 100.000
    <http://dblp.rkbexplorer.com/>       DBLP RKB Explorer      > 1000
    <http://laas.rkbexplorer.com/>       LAAS CNRS           > 100.000

Looking to primary sources again --

    Current statistics for this repository (ieee.rkbexplorer.com) —

       Last data assertion  2009-02-06 13:28:04
       Number of triples    111442
       Number of symbols    31552
       Size of RDF dataset  8.2M

    Current statistics for the CRS for this repository  
(ieee.rkbexplorer.com) —

       Last data assertion   2009-03-25 16:52:19
       Number of URIs        15142
       Number of bundles     25410
       of which active       4874

(Also according to this site, 'A CRS maintaines "bundles" of URIs  
which are
deemed to be equivalent', which I presume means they are tied by  
owl:sameAs.)

It seems clear that the initial statistic was reported as thousands, and
should be changed.  However, the out-links still don't add up.  300,000
links (to eprints, CiteSeer, and LAAS CNRS) cannot be made with a third
that many total triples.  A little more digging revealed [6] --

    DBLP RKB Explorer   dblp.rkbexplorer.com       5053 URIs
    ACM                 acm.rkbexplorer.com        2511 URIs
    CiteSeer            citeseer.rkbexplorer.com    888 URIs
    eprints             eprints.rkbexplorer.com     602 URIs
    LAAS CNRS           laas.rkbexplorer.com         93 URIs

-- (sorted by external URI counts, and trimmed to include only those  
external 
sets linked by more than 100 URIs, plus LAAS which apparently was  
incorrectly
included in the existing list).

Clearly, whomever posted these values to the table read "100.000" to  
mean
"one-hundred and zero-thousandths" rather than "one-hundred-thousand".

The correct information for IEEE appears to be --

    <http://ieee.rkbexplorer.com/>     IEEE     111,442

-- which purports to be linked out as follows --

    <http://acm.rkbexplorer.com/>        ACM                 > 1000
    <http://eprints.rkbexplorer.com/>    eprints             >  100
    <http://citeseer.rkbexplorer.com/>   CiteSeer            >  100
    <http://dblp.rkbexplorer.com/>       DBLP RKB Explorer   > 1000

-- (and I've applied these corrections to [3] and [4]).


How many other errors are there in this data?  And how much will those
corrections change the diagrams based upon it?

I'll continue reviewing and correcting things, but thought you should
all be aware that the current table and diagrams may be substantially
incorrect.

Be seeing you,

Ted


[1] <http://virtuoso.openlinksw.com/images/dbpedia-lod-cloud.html>
[2] <http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-03-05.html 
 >
[3] <http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets/Statistics 
 >
[4] <http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets/LinkStatistics 
 >
[5] <http://dbtune.org/bbc/playcount/>
[6] <http://ieee.rkbexplorer.com/crs/foreign.php>


-- 
A: Yes.                      http://www.guckes.net/faq/attribution.html
| Q: Are you sure?
| | A: Because it reverses the logical flow of conversation.
| | | Q: Why is top posting frowned upon?

Ted Thibodeau, Jr.           //               voice +1-781-273-0900 x32
Evangelism & Support         //        mailto:tthibodeau@openlinksw.com
OpenLink Software, Inc.      //              http://www.openlinksw.com/
                                  http://www.openlinksw.com/weblogs/uda/
OpenLink Blogs              http://www.openlinksw.com/weblogs/virtuoso/
                                http://www.openlinksw.com/blog/~kidehen/
     Universal Data Access and Virtual Database Technology Providers

Received on Wednesday, 1 April 2009 06:09:53 UTC