Re: Size of the Semantic Web was: Semantic Web Ontology Map

Hi Chris,

all algorithms operating on the Web are incomplete in practice, since
you can never assume complete information on the Web.  Apparently
this is not clear, so I've changed the sentence to "a map of a part
of the Semantic Web" which should be more precise and address your
concern.

Joshua already indicated that crawling large database-backed sites
one URI at a time is awkward.  It puts a high load on the servers
publishing data.  Also, complete crawling of db-backed sites takes
an unacceptable amount of time.  A polite crawler can fetch around
9k pages per day (let's say with 10 seconds wait time between
requests), which means crawling sites such as geonames or uniprot
serving millions of URIs requires years.  For these reasons, we
currently follow rdfs:seeAlso links and thus do not yet include
complete "Linking Open Data" sites in the map.

I believe that crawlers have slightly different sets of requirements
than visual RDF browsers when it comes to sites that dump huge
amounts of data to the Web.  This is true for both linked data and
rdfs:seeAlso based approaches.  The sitemap extension [1] is one
potential way of helping crawlers operate more efficiently on
the Semantic Web, but I'm sure there are other solutions to the
problem as well.

Regards,
Andreas.

[1] http://sw.deri.org/2007/07/sitemapextension/

Chris Bizer wrote:
>
> Hi Sheila and all,
>
> it is a great idea to try to draw a map of the Semantic Web and to
> provide people with a place to refer to in order to see the Semantic
> Web grow.
> So great idea!
>
> But what confuses me a bit is your claim that this is a map of THE
> Semantic Web as June 2007.
>
> You have got 200 000 RDF files.
>
> If you look at Swoogle's statistic page
> http://swoogle.umbc.edu/index.php?option=com_swoogle_stats&Itemid=8,
> you see that they have 1.2 million files amounting to 436 million
> triples.
>
> If you look at the Linking Open Data project page
> http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
> you will see that there are over one billion triples and I would guess
> that the different servers within the project serve around 30 million
> RDF documents to the Web.
>
> So my guess would be that:
>
> - your dataset covers less than 1% of the Semantic Web
> - Swoogle covers about 4 % of the Semantic Web
>
> as of June 2007.
>
> So I think it would be important that people who claim to cover the
> whole Semantic Web would give some details about the crawling
> algorithms they use to get their datasets so that it is possible to
> judge the accuracy of their results.
>
> The datasources in the Linking Open Data project are all interlinked
> with RDF links. So it is possible to crawl all 30 million documents by
> following these links. Good starting points for a crawl are URIs
> identifying concepts from different domains within DBpedia, as they
> are interlinked with many other data sets.
>
> Some background information about the idea of RDF Links and how
> crawlers can follow these links are found in
> http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LinkedDataTutorial/
>
> Cheers
>
> Chris
>
>
> -- 
> Chris Bizer
> Freie Universität Berlin
> +49 30 838 54057
> chris@bizer.de
> www.bizer.de
> ----- Original Message ----- From: "Golda Velez" <w3@webglimpse.org>
> To: "Kinsella, Sheila" <sheila.kinsella@deri.org>
> Cc: <semantic-web@w3.org>
> Sent: Saturday, July 28, 2007 5:31 AM
> Subject: Re: Semantic Web Ontology Map
>
>
>
> Very cool!  Is there also a text representation of this graph available?
>
> Thanks!
>
> --Golda
>
>>
>> Dear all,
>>
>> For those of you who are interested in seeing what type of RDF data is
>> available on the web as of now, we provide a current overview on the
>> state of the Semantic Web at http://sw.deri.org/2007/06/ontologymap/
>>
>> Here you can see a graphical representation which Andreas Harth and I
>> have created showing the most commonly occurring classes and the
>> frequency of links between them.
>>
>> Any feedback or questions are welcome.
>>
>> Sheila
>>
>>
>
>
>

Received on Saturday, 28 July 2007 23:16:45 UTC