W3C home > Mailing lists > Public > public-lod@w3.org > April 2009

Re: sanity checking the LOD Cloud statistics

From: Yves Raimond <yves.raimond@gmail.com>
Date: Wed, 1 Apr 2009 10:09:10 +0100
Message-ID: <82593ac00904010209y5ca0bd57r37513bf8b25e3eb9@mail.gmail.com>
To: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Cc: public-lod@w3.org
Hello!

> Now to the discrepancies.  From [3], I got this line --
>
>   <http://dbtune.org/bbc/playcount/>   BBC Playcount Data      10,000
>
> At first read, I thought that meant 10,000 triples.  But [4] indicated
> these external link counts for BBC Playcount Data --
>
>   <http://www.bbc.co.uk/programmes>    BBC Programmes     > 100.000
>   <http://dbtune.org/musicbrainz>      Musicbrainz        > 100.000
>
> I don't see a way for 10,000 triples to include 200,000 external links.
> That means that the first count must be of Entities.  But going to the
> BBC Playcount home page [5], I found --
>
>   Triple count                        1,954,786
>   Distinct BBC Programmes resources       6,863
>   Distinct Musicbrainz resources          7,055
>
> An obvious missing number here is a count of minted URIs -- that is, of
> BBC Playcount resources/entities -- but I also learned that BBC Playcount
> URIs are not pointers-to-values, but values-in-themselves.  The count is
> *embedded* in the URI (and thus, if a count changes, the URI changes!) --
>
>   A playcount URI in this service looks like:
>
>      http://dbtune.org/bbc/playcount/<id>_<k>
>
>   Where <id> is the id of the episode or the brand, as in /programmes BBC
>   catalogue, and <k> is a number between 0 and the number of playcounts
>   for the episode or the brand.
>
> If we accept this URI construction as reasonable (which I don't), it seems
> that <k> must actually be a "natural" or "counting" number (i.e., an integer
> greater than or equal to 1).  A value of 0 is nonsensical, as it would
> result
> in a Cartesian data set -- where each and every Musicbrainz resource gets
> a Playcount URI for each and every Programme resource -- and most of these
> Playcount URIs would have <k> = 0, for most Musicbrainz resources were not
> played in most Programmes.


Ted, the BBC playcount data available on DBTune was created, and
reflects data captured *at a particular point in time* (Mashed last
year, so the 6th of June, 2008). So these URIs are OK and persistent.
This is also why this playcount information is reified, so that we can
optionally express at which date they were captured.
Getting any of these URIs will lead you to a statement making that
count *at a particular point in time* explicit.

SPARQLing for the number of distinct URIs gives 187062

Live updated playcount data will soon be available in BBC Programmes itself.
I hope that answers your questions,

y
Received on Wednesday, 1 April 2009 09:09:55 UTC

This archive was generated by hypermail 2.3.1 : Sunday, 31 March 2013 14:24:20 UTC