W3C home > Mailing lists > Public > public-lod@w3.org > November 2008

Re: Size matters -- How big is the danged thing

From: Giovanni Tummarello <giovanni.tummarello@deri.org>
Date: Sat, 22 Nov 2008 01:13:37 +0000
Message-ID: <210271540811211713v15b89fceua21c1b4861ecd19e@mail.gmail.com>
To: "Tom Heath" <tom.heath@talis.com>
Cc: "Jim Hendler" <hendler@cs.rpi.edu>, "Michael Hausenblas" <michael.hausenblas@deri.org>, public-lod@w3.org

Well, when a sitemap is submitted the dataset is usually counter right
away and with no crawling uncertaininty.

e.g. cycorp submitting theirs yesterday, 130k linked data documents
indexed today http://sindice.com/search?q=opencyc&qt=term

I'll try to get some daily calculated stats out next week. We had this
prototypical idea of a map of data built live see the sketch we had at
http://sindice.com/map .
Ideally the goal was to have a dynamical lod map, actually useful for
crafting queries (with stats on the side)

But the project will require more time.


On Fri, Nov 21, 2008 at 4:47 PM, Tom Heath <tom.heath@talis.com> wrote:
> Hi Jim, all,
> At WWW2008 ChrisB and I approached R Guha to ask if Google could apply
> some of their considerable resources to answering this question. The
> response went something like "sure, we can do that, email me", but
> since then we've been unable to get any further responses. Perhaps you
> have a stronger connection there and could nudge that?
> Alternatively, perhaps Yahoo or the Falcon-S guys could help out, as
> they seem to have a pretty comprehensive crawl, or maybe SWSE could.
> Surely there's some kudos to be had in being the de facto authority on
> the size of the Web of Data, at least for a few months/years yet.
> I agree, size does matter. Time for another single function web site
> at howbigisthewebofdata.com? ;)
> Tom.
> 2008/11/20 Jim Hendler <hendler@cs.rpi.edu>:
>> I guess I asked the question wrong - the linked open data project currently
>> identifies a specific set of dat resources that are linked together - so
>> thie "entity" is definable - I didn't mean to  ask how big the whole
>> Semantic Web is - I meant how many triples are in this particular group -
>> the set that are described on
>> http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
>> I've been able to download pictures of this graph every few months or so,
>> and you can see the number of datasets growing, but the last published
>> number of triples for the thing (as stated on that page) is from over a year
>> ago, and a whole bunch of stuff has been added and some of these have grown
>> a lot - so we have a publicly shared, large-scale, RDF data resource that
>> can be used for benchmarking, trying different interfaces and new
>> technologies, etc
>> So it would be really nice to get a number every now and then so we could
>> plot growth, explain to people what is in it better, etc.
>> I know, I know, I know all the technical reasons this is relatively
>> meaningless, but I gotta tell you, when I hear someone say "20 billion
>> triples," I can tell you it it causes people to pay attention -- problem is
>> I would like to use a number that has some validity before I start quoting
>> it....
>> On Nov 20, 2008, at 5:12 AM, Michael Hausenblas wrote:
>>> My 2c in order to capture this for others as well:
>>> http://community.linkeddata.org/MediaWiki/index.php?HowBigIsTheDangedThing
>>> Cheers,
>>>        Michael
>>> ----------------------------------------------------------
>>> Dr. Michael Hausenblas
>>> DERI - Digital Enterprise Research Institute
>>> National University of Ireland, Lower Dangan,
>>> Galway, Ireland
>>> ----------------------------------------------------------
>>> Jim Hendler wrote:
>>>> So I've been to a number of talks lately where the size of the current
>>>> (Sept 08 diagram) Linked Open Data cloud, in triples, has been stated - with
>>>> numbers that vary quite widely.  The esw wiki says 2B triples as of 2007,
>>>> which isn't very useful given the growth we've seen in the past year -- I've
>>>> also seen the various blog posts and mail threads saying why we shouldn't
>>>> cit meaningless numbers and such - but frankly, I've recently been on a
>>>> bunch of panels with DB guys, and I'd love to have a reasonable number to
>>>> quote -- anyone have a good estimate of the size of the danged thing (number
>>>> of triples in the whole as an RDF graph would be nice) -- would also be nice
>>>> for general audiences where big numbers tend to impress and for research
>>>> purposes (for example, we know how far we can compress the triples for an in
>>>> memory approach we are playing with, but we want to figure out how much
>>>> memory we need for the whole cloud - we want to know if we need to shell out
>>>> for the 16G iphone)
>>>> anyway, if anyone has a decent estimate, or even a smart educated guess,
>>>> I'd love to hear it
>>>> JH
>>>> "If we knew what we were doing, it wouldn't be called research, would
>>>> it?." - Albert Einstein
>>>> Prof James Hendler                http://www.cs.rpi.edu/~hendler
>>>> Tetherless World Constellation Chair
>>>> Computer Science Dept
>>>> Rensselaer Polytechnic Institute, Troy NY 12180
>> "If we knew what we were doing, it wouldn't be called research, would it?."
>> - Albert Einstein
>> Prof James Hendler
>>  http://www.cs.rpi.edu/~hendler
>> Tetherless World Constellation Chair
>> Computer Science Dept
>> Rensselaer Polytechnic Institute, Troy NY 12180
>> Find out more about Talis at  www.talis.com
>> Shared InnovationTM
>> Any views or personal opinions expressed within this email may not be those
>> of Talis Information Ltd. The content of this email message and any files
>> that may be attached are confidential, and for the usage of the intended
>> recipient only. If you are not the intended recipient, then please return
>> this message to the sender and delete it. Any use of this e-mail by an
>> unauthorised recipient is prohibited.
>> Talis Information Ltd is a member of the Talis Group of companies and is
>> registered in England No 3638278 with its registered office at Knights
>> Court, Solihull Parkway, Birmingham Business Park, B37 7YB.
>> ______________________________________________________________________
>> This email has been scanned by the MessageLabs Email Security System.
>> For more information please visit
>> http://www.messagelabs.com/email______________________________________________________________________
> --
> Dr Tom Heath
> Researcher
> Platform Division
> Talis Information Ltd
> T: 0870 400 5000
> W: http://www.talis.com/
Received on Saturday, 22 November 2008 01:14:22 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:20:43 UTC