Re: Size of the Semantic Web was: Semantic Web Ontology Map from Chris Bizer on 2007-07-29 (semantic-web@w3.org from July 2007)

From: Chris Bizer <chris@bizer.de>
Date: Sun, 29 Jul 2007 14:11:01 +0200
To: "Andreas Harth" <andreas.harth@deri.org>
Cc: <gv@btucson.com>, "Kinsella, Sheila" <sheila.kinsella@deri.org>, "tim finin" <finin@cs.umbc.edu>, <lushan1@umbc.edu>, <semantic-web@w3.org>, <juergen@umbrich.net>
Message-ID: <288d01c7d1d9$881d79d0$c4e84d57@named4gc1asnuj>
Hi Andreas,

>
> Hi Chris,
>
> all algorithms operating on the Web are incomplete in practice, since
> you can never assume complete information on the Web.  Apparently
> this is not clear, so I've changed the sentence to "a map of a part
> of the Semantic Web" which should be more precise and address your
> concern.
>
> Joshua already indicated that crawling large database-backed sites
> one URI at a time is awkward.

I have to violently disagree with you on this point.

Look at Google or any other search engine on this planet. They are exectly 
doing this, which the only difference that they do not request the RDF 
representation of a data base record, but its HTML representation. They do 
this for millions of database-backed sites and it works fine.

The Semantic Web is just another development step in the overall development 
of the Web. Michael K. Bergman has put this nicely in the "Web in 
Transition" picture in one of his recent blog posts 
(http://www.mkbergman.com/?p=391) where he distinguishes between the 
Document Web, the Structured Web, Linked Data and the Semantic Web.

Therefore I think that the Semantic Web should mirror sucessful techniques 
from the classic document Web. Having hyperlinks between Web documents and 
having crawlers follow this hyperlinks is clearly one of the more sucessful 
techniques of the classic Web and I therefore do not see any reason why it 
should not work for the Semantic Web.

>  It puts a high load on the servers
> publishing data.  Also, complete crawling of db-backed sites takes
> an unacceptable amount of time.  A polite crawler can fetch around
> 9k pages per day (let's say with 10 seconds wait time between
> requests), which means crawling sites such as geonames or uniprot
> serving millions of URIs requires years.

I'm quite happy that the times where the complete Semantic Web fitted on a 
memory stick are over.
Even if this means that people who publish larger dataset and people who 
crawl these datasets have to buy proper hardware.

Scalability really should not be the issue when we discuss best practices 
for the Semantic Web. You claimed that your YARS store can handle 8 billion 
triples. Orri from Openlink is currently working on cluster features for 
Virtuoso which will also enable queries over billions of triples. At 
WWW2007, the Freebase guys where completely relaxed when I asked them 
whether they can store billions of triples.

The average size of a Google database is 3 Petabyte today. An they are 
currently working on bringing together about 100 of these databases, see
http://video.google.com/videoplay?docid=-2727172597104463277

I think that for the Semantic Web to be relevant for the average user, we 
clearly have to aim a such dimension and should not be scared by 30 million 
documents and the time it would take to crawl them. Maybe we should better 
have a look on how Google's robots manage to crawl billions of documents 
including a far portion that is also served by slow servers.

> For these reasons, we
> currently follow rdfs:seeAlso links and thus do not yet include
> complete "Linking Open Data" sites in the map.
>
> I believe that crawlers have slightly different sets of requirements
> than visual RDF browsers when it comes to sites that dump huge
> amounts of data to the Web.  This is true for both linked data and
> rdfs:seeAlso based approaches.  The sitemap extension [1] is one
> potential way of helping crawlers operate more efficiently on
> the Semantic Web, but I'm sure there are other solutions to the
> problem as well.

Yes, I think Giovani's work on the site map extension is very important and 
can provide a valuable shortcut for crawlers, but I also think that the 
classic document Web would not be where it is today if the crawlers where 
scared of hitting some data and getting it piece by piece.

Andreas, don't take all of this as personal criticism. I really appreciate 
your work on YARS and SWSE and think that with these components you are in 
the position to be one of first guys that could build a proper Semantic Web 
search engine.

Cheers

Chris

> Regards,
> Andreas.
>
> [1] http://sw.deri.org/2007/07/sitemapextension/
>
> Chris Bizer wrote:
>>
>> Hi Sheila and all,
>>
>> it is a great idea to try to draw a map of the Semantic Web and to
>> provide people with a place to refer to in order to see the Semantic
>> Web grow.
>> So great idea!
>>
>> But what confuses me a bit is your claim that this is a map of THE
>> Semantic Web as June 2007.
>>
>> You have got 200 000 RDF files.
>>
>> If you look at Swoogle's statistic page
>> http://swoogle.umbc.edu/index.php?option=com_swoogle_stats&Itemid=8,
>> you see that they have 1.2 million files amounting to 436 million
>> triples.
>>
>> If you look at the Linking Open Data project page
>> http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
>> you will see that there are over one billion triples and I would guess
>> that the different servers within the project serve around 30 million
>> RDF documents to the Web.
>>
>> So my guess would be that:
>>
>> - your dataset covers less than 1% of the Semantic Web
>> - Swoogle covers about 4 % of the Semantic Web
>>
>> as of June 2007.
>>
>> So I think it would be important that people who claim to cover the
>> whole Semantic Web would give some details about the crawling
>> algorithms they use to get their datasets so that it is possible to
>> judge the accuracy of their results.
>>
>> The datasources in the Linking Open Data project are all interlinked
>> with RDF links. So it is possible to crawl all 30 million documents by
>> following these links. Good starting points for a crawl are URIs
>> identifying concepts from different domains within DBpedia, as they
>> are interlinked with many other data sets.
>>
>> Some background information about the idea of RDF Links and how
>> crawlers can follow these links are found in
>> http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LinkedDataTutorial/
>>
>> Cheers
>>
>> Chris
>>
>>
>> -- 
>> Chris Bizer
>> Freie Universität Berlin
>> +49 30 838 54057
>> chris@bizer.de
>> www.bizer.de
>> ----- Original Message ----- From: "Golda Velez" <w3@webglimpse.org>
>> To: "Kinsella, Sheila" <sheila.kinsella@deri.org>
>> Cc: <semantic-web@w3.org>
>> Sent: Saturday, July 28, 2007 5:31 AM
>> Subject: Re: Semantic Web Ontology Map
>>
>>
>>
>> Very cool!  Is there also a text representation of this graph available?
>>
>> Thanks!
>>
>> --Golda
>>
>>>
>>> Dear all,
>>>
>>> For those of you who are interested in seeing what type of RDF data is
>>> available on the web as of now, we provide a current overview on the
>>> state of the Semantic Web at http://sw.deri.org/2007/06/ontologymap/
>>>
>>> Here you can see a graphical representation which Andreas Harth and I
>>> have created showing the most commonly occurring classes and the
>>> frequency of links between them.
>>>
>>> Any feedback or questions are welcome.
>>>
>>> Sheila
>>>
>>>
>>
>>
>>
>
Received on Sunday, 29 July 2007 12:12:17 UTC