- From: Michal Finkelstein <michal.finkelstein@thomsonreuters.com>
- Date: Wed, 1 Apr 2009 03:37:29 -0400
- To: Ted Thibodeau Jr <tthibodeau@openlinksw.com>, public-lod@w3.org
Hi Ted, First, I totally agree with the need to change the current (relatively arbitrary) levels. Values like > 100 and even > 100,000 seem a bit anachronistic; I guess these ranges were valid in the very first days of the LOD Cloud, but today, for the most part, we're talking about millions of URIs, triples etc. Two significant errors I see related to OpenCalais: Open Calais DBpedia > 100 Open Calais Freebase > 100 The correct number should be > 100,000 for both OpenCalais-to-DBpedia and OpenCalais-to-Freebase link counts. To make sure we're on the same page: that's larger than one hundred thousand. Also regarding the size of the data set: OpenCalais 4,500,000 The number shown actually refers to the URI count and not to the number of triples. The number of triples is at least 10 times bigger, or: 45,000,000 (that's 45 million triples). Regards, Michal * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Michal Finkelstein Director, Content Strategy The Calais Initiative Thomson Reuters michal.finkelstein@thomsonreuters.com -----Original Message----- From: public-lod-request@w3.org [mailto:public-lod-request@w3.org] On Behalf Of Ted Thibodeau Jr Sent: Wednesday, April 01, 2009 9:06 AM To: public-lod@w3.org Subject: sanity checking the LOD Cloud statistics Hello, all -- I've had a few minutes to start working to update my version [1] of the LOD Cloud diagram [2], which means I got to start looking at the Data Set Statistics [3] and Link Statistics [4] pages. I have found a number of apparent discrepancies. I'm not sure where these came from, but I think they need attention and correction. [3] gave some round, and some exact values. It's not at all clear whether these values were originally intended to reflect triple-counts in the data set, URIs minted there (i.e., Entities named there), or something else entirely. I think the page holds a mix of these, which makes them rather troublesome as a source of comparison between data sets. [4] had few exact values, which appear to have been incorrectly added there, and apparently means to use only 3 "counts" for the inter-set linkages -- "> 100", "> 1000" "> 100.000". Clearly, the last means more-than-one- hundred-thousand -- because the first clearly means more-than-one- hundred -- but this was not obvious at first glance, given my US-training that the period is used for the decimal, not for the thousands delimiter. First thing, therefor, I suggest that all period-delimiters on [4] change to comma-delimiters, to match the first page. (I've actually made this change, but incorrect values may well remain -- please read on.) I think it also makes sense to add "> 10,000", and "> 1,000,000" to the values here. Just looking at the DBpedia "actual counts" which were on the page, it's clear that a log-scale comparing the interlinkage levels presents a better picture than the three arbitrarily chosen levels. (Again, I've started using these as relevant.) Now to the discrepancies. From [3], I got this line -- <http://dbtune.org/bbc/playcount/> BBC Playcount Data 10,000 At first read, I thought that meant 10,000 triples. But [4] indicated these external link counts for BBC Playcount Data -- <http://www.bbc.co.uk/programmes> BBC Programmes > 100.000 <http://dbtune.org/musicbrainz> Musicbrainz > 100.000 I don't see a way for 10,000 triples to include 200,000 external links. That means that the first count must be of Entities. But going to the BBC Playcount home page [5], I found -- Triple count 1,954,786 Distinct BBC Programmes resources 6,863 Distinct Musicbrainz resources 7,055 An obvious missing number here is a count of minted URIs -- that is, of BBC Playcount resources/entities -- but I also learned that BBC Playcount URIs are not pointers-to-values, but values-in-themselves. The count is *embedded* in the URI (and thus, if a count changes, the URI changes!) -- A playcount URI in this service looks like: http://dbtune.org/bbc/playcount/<id>_<k> Where <id> is the id of the episode or the brand, as in / programmes BBC catalogue, and <k> is a number between 0 and the number of playcounts for the episode or the brand. If we accept this URI construction as reasonable (which I don't), it seems that <k> must actually be a "natural" or "counting" number (i.e., an integer greater than or equal to 1). A value of 0 is nonsensical, as it would result in a Cartesian data set -- where each and every Musicbrainz resource gets a Playcount URI for each and every Programme resource -- and most of these Playcount URIs would have <k> = 0, for most Musicbrainz resources were not played in most Programmes. Even if Zero-Play URIs are created only for those Musicbrainz resources which were played in *some* Programme, for those Programmes where they weren't played, far more URIs are created than are needed. I'm hoping that the folks who built this data set are reading, and will consider restructuring it. I'd suggest that the URI structure should be more like -- http://dbtune.org/bbc/playcount/<id>_count -- where <id> reflects *either* Programmes *or* Musicbrainz ID (this may mean further thinking, as I'm not directly familiar with these IDs, and Programmes may conflict with Musicbrainz), and the count (the *value*) is returned when the constructed URI is dereferenced. More baffling, and more troubling, on [3] I found -- <http://ieee.rkbexplorer.com/> IEEE 111 -- which purports to be linked out as follows -- <http://acm.rkbexplorer.com/> ACM > 1000 <http://eprints.rkbexplorer.com/> eprints > 100.000 <http://citeseer.rkbexplorer.com/> CiteSeer > 100.000 <http://dblp.rkbexplorer.com/> DBLP RKB Explorer > 1000 <http://laas.rkbexplorer.com/> LAAS CNRS > 100.000 Looking to primary sources again -- Current statistics for this repository (ieee.rkbexplorer.com) - Last data assertion 2009-02-06 13:28:04 Number of triples 111442 Number of symbols 31552 Size of RDF dataset 8.2M Current statistics for the CRS for this repository (ieee.rkbexplorer.com) - Last data assertion 2009-03-25 16:52:19 Number of URIs 15142 Number of bundles 25410 of which active 4874 (Also according to this site, 'A CRS maintaines "bundles" of URIs which are deemed to be equivalent', which I presume means they are tied by owl:sameAs.) It seems clear that the initial statistic was reported as thousands, and should be changed. However, the out-links still don't add up. 300,000 links (to eprints, CiteSeer, and LAAS CNRS) cannot be made with a third that many total triples. A little more digging revealed [6] -- DBLP RKB Explorer dblp.rkbexplorer.com 5053 URIs ACM acm.rkbexplorer.com 2511 URIs CiteSeer citeseer.rkbexplorer.com 888 URIs eprints eprints.rkbexplorer.com 602 URIs LAAS CNRS laas.rkbexplorer.com 93 URIs -- (sorted by external URI counts, and trimmed to include only those external sets linked by more than 100 URIs, plus LAAS which apparently was incorrectly included in the existing list). Clearly, whomever posted these values to the table read "100.000" to mean "one-hundred and zero-thousandths" rather than "one-hundred-thousand". The correct information for IEEE appears to be -- <http://ieee.rkbexplorer.com/> IEEE 111,442 -- which purports to be linked out as follows -- <http://acm.rkbexplorer.com/> ACM > 1000 <http://eprints.rkbexplorer.com/> eprints > 100 <http://citeseer.rkbexplorer.com/> CiteSeer > 100 <http://dblp.rkbexplorer.com/> DBLP RKB Explorer > 1000 -- (and I've applied these corrections to [3] and [4]). How many other errors are there in this data? And how much will those corrections change the diagrams based upon it? I'll continue reviewing and correcting things, but thought you should all be aware that the current table and diagrams may be substantially incorrect. Be seeing you, Ted [1] <http://virtuoso.openlinksw.com/images/dbpedia-lod-cloud.html> [2] <http://www4.wiwiss.fu-berlin.de/bizer/pub/lod-datasets_2009-03-05.html > [3] <http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/Da taSets/Statistics > [4] <http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/Da taSets/LinkStatistics > [5] <http://dbtune.org/bbc/playcount/> [6] <http://ieee.rkbexplorer.com/crs/foreign.php> -- A: Yes. http://www.guckes.net/faq/attribution.html | Q: Are you sure? | | A: Because it reverses the logical flow of conversation. | | | Q: Why is top posting frowned upon? Ted Thibodeau, Jr. // voice +1-781-273-0900 x32 Evangelism & Support // mailto:tthibodeau@openlinksw.com OpenLink Software, Inc. // http://www.openlinksw.com/ http://www.openlinksw.com/weblogs/uda/ OpenLink Blogs http://www.openlinksw.com/weblogs/virtuoso/ http://www.openlinksw.com/blog/~kidehen/ Universal Data Access and Virtual Database Technology Providers This email was sent to you by Thomson Reuters, the global news and information company. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Thomson Reuters.
Received on Wednesday, 1 April 2009 08:09:17 UTC