Re: Size of the Semantic Web was: Semantic Web Ontology Map from Marc Wick on 2007-07-29 (semantic-web@w3.org from July 2007)

From: Marc Wick <marc@geonames.org>
Date: Sun, 29 Jul 2007 17:14:32 +0200
To: Chris Bizer <chris@bizer.de>
CC: Andreas Harth <andreas.harth@deri.org>, gv@btucson.com, "Kinsella, Sheila" <sheila.kinsella@deri.org>, tim finin <finin@cs.umbc.edu>, lushan1@umbc.edu, semantic-web@w3.org, juergen@umbrich.net
Message-ID: <46ACAED8.8060909@geonames.org>
Chris

The problem is not writing and running a SW crawler, the problem is 
being crawled. A provider of a large dataset with millions of URIs 
simply does not want to be crawled by a SW crawler for lack of computing 
resources. For a html crawler like google you can live with it since 
they will bring traffic to your site and your manager will understand 
that you have to buy additional servers to feed the google crawlers. Not 
so for SW crawlers. As a data provider you don't want to waste resources 
on a SW crawler.

I hope the Semantic Web is not just mirroring an obsolete legacy 
approach like 'crawling'. The crawling approach is dead, even in the 
good old html world. Think about all the ajax driven websites. How does 
a crawler fetch data from an ajax driven site? Well, it does not.
The answer is the sitemap protocol that allows web masters to 
communicate with the crawler. The crawler no longer has to blindly 
follow links, the web master can tell the crawler instead which 
resources are available, how often each resource is expected to change, 
the timestamp of the last modification of each resource and so on.

I believe the Semantic Web should at least base on the sitemap protocol, 
and I hope the Semantic Web will even find better solutions. Let's not 
forget it is really difficult for improvements in the html world for all 
the legacy applications that need to be supported. Not so in the 
semantic web, there aren't any legacy applications and there is room for 
real innovations. Let's not waste this opportunity.

Cheers

Marc

Chris Bizer wrote:
>
> Hi Andreas,
>
>>
>> Hi Chris,
>>
>> all algorithms operating on the Web are incomplete in practice, since
>> you can never assume complete information on the Web.  Apparently
>> this is not clear, so I've changed the sentence to "a map of a part
>> of the Semantic Web" which should be more precise and address your
>> concern.
>>
>> Joshua already indicated that crawling large database-backed sites
>> one URI at a time is awkward.
>
> I have to violently disagree with you on this point.
>
> Look at Google or any other search engine on this planet. They are 
> exectly doing this, which the only difference that they do not request 
> the RDF representation of a data base record, but its HTML 
> representation. They do this for millions of database-backed sites and 
> it works fine.
>
> The Semantic Web is just another development step in the overall 
> development of the Web. Michael K. Bergman has put this nicely in the 
> "Web in Transition" picture in one of his recent blog posts 
> (http://www.mkbergman.com/?p=391) where he distinguishes between the 
> Document Web, the Structured Web, Linked Data and the Semantic Web.
>
> Therefore I think that the Semantic Web should mirror sucessful 
> techniques from the classic document Web. Having hyperlinks between 
> Web documents and having crawlers follow this hyperlinks is clearly 
> one of the more sucessful techniques of the classic Web and I 
> therefore do not see any reason why it should not work for the 
> Semantic Web.
>
>>  It puts a high load on the servers
>> publishing data.  Also, complete crawling of db-backed sites takes
>> an unacceptable amount of time.  A polite crawler can fetch around
>> 9k pages per day (let's say with 10 seconds wait time between
>> requests), which means crawling sites such as geonames or uniprot
>> serving millions of URIs requires years.
>
> I'm quite happy that the times where the complete Semantic Web fitted 
> on a memory stick are over.
> Even if this means that people who publish larger dataset and people 
> who crawl these datasets have to buy proper hardware.
>
> Scalability really should not be the issue when we discuss best 
> practices for the Semantic Web. You claimed that your YARS store can 
> handle 8 billion triples. Orri from Openlink is currently working on 
> cluster features for Virtuoso which will also enable queries over 
> billions of triples. At WWW2007, the Freebase guys where completely 
> relaxed when I asked them whether they can store billions of triples.
>
> The average size of a Google database is 3 Petabyte today. An they are 
> currently working on bringing together about 100 of these databases, see
> http://video.google.com/videoplay?docid=-2727172597104463277
>
> I think that for the Semantic Web to be relevant for the average user, 
> we clearly have to aim a such dimension and should not be scared by 30 
> million documents and the time it would take to crawl them. Maybe we 
> should better have a look on how Google's robots manage to crawl 
> billions of documents including a far portion that is also served by 
> slow servers.
>
>> For these reasons, we
>> currently follow rdfs:seeAlso links and thus do not yet include
>> complete "Linking Open Data" sites in the map.
>>
>> I believe that crawlers have slightly different sets of requirements
>> than visual RDF browsers when it comes to sites that dump huge
>> amounts of data to the Web.  This is true for both linked data and
>> rdfs:seeAlso based approaches.  The sitemap extension [1] is one
>> potential way of helping crawlers operate more efficiently on
>> the Semantic Web, but I'm sure there are other solutions to the
>> problem as well.
>
> Yes, I think Giovani's work on the site map extension is very 
> important and can provide a valuable shortcut for crawlers, but I also 
> think that the classic document Web would not be where it is today if 
> the crawlers where scared of hitting some data and getting it piece by 
> piece.
>
> Andreas, don't take all of this as personal criticism. I really 
> appreciate your work on YARS and SWSE and think that with these 
> components you are in the position to be one of first guys that could 
> build a proper Semantic Web search engine.
>
> Cheers
>
> Chris
>
>> Regards,
>> Andreas.
>>
>> [1] http://sw.deri.org/2007/07/sitemapextension/
>>
>> Chris Bizer wrote:
>>>
>>> Hi Sheila and all,
>>>
>>> it is a great idea to try to draw a map of the Semantic Web and to
>>> provide people with a place to refer to in order to see the Semantic
>>> Web grow.
>>> So great idea!
>>>
>>> But what confuses me a bit is your claim that this is a map of THE
>>> Semantic Web as June 2007.
>>>
>>> You have got 200 000 RDF files.
>>>
>>> If you look at Swoogle's statistic page
>>> http://swoogle.umbc.edu/index.php?option=com_swoogle_stats&Itemid=8,
>>> you see that they have 1.2 million files amounting to 436 million
>>> triples.
>>>
>>> If you look at the Linking Open Data project page
>>> http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData 
>>>
>>> you will see that there are over one billion triples and I would guess
>>> that the different servers within the project serve around 30 million
>>> RDF documents to the Web.
>>>
>>> So my guess would be that:
>>>
>>> - your dataset covers less than 1% of the Semantic Web
>>> - Swoogle covers about 4 % of the Semantic Web
>>>
>>> as of June 2007.
>>>
>>> So I think it would be important that people who claim to cover the
>>> whole Semantic Web would give some details about the crawling
>>> algorithms they use to get their datasets so that it is possible to
>>> judge the accuracy of their results.
>>>
>>> The datasources in the Linking Open Data project are all interlinked
>>> with RDF links. So it is possible to crawl all 30 million documents by
>>> following these links. Good starting points for a crawl are URIs
>>> identifying concepts from different domains within DBpedia, as they
>>> are interlinked with many other data sets.
>>>
>>> Some background information about the idea of RDF Links and how
>>> crawlers can follow these links are found in
>>> http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LinkedDataTutorial/
>>>
>>> Cheers
>>>
>>> Chris
>>>
>>>
>>> -- 
>>> Chris Bizer
>>> Freie Universität Berlin
>>> +49 30 838 54057
>>> chris@bizer.de
>>> www.bizer.de
>>> ----- Original Message ----- From: "Golda Velez" <w3@webglimpse.org>
>>> To: "Kinsella, Sheila" <sheila.kinsella@deri.org>
>>> Cc: <semantic-web@w3.org>
>>> Sent: Saturday, July 28, 2007 5:31 AM
>>> Subject: Re: Semantic Web Ontology Map
>>>
>>>
>>>
>>> Very cool!  Is there also a text representation of this graph 
>>> available?
>>>
>>> Thanks!
>>>
>>> --Golda
>>>
>>>>
>>>> Dear all,
>>>>
>>>> For those of you who are interested in seeing what type of RDF data is
>>>> available on the web as of now, we provide a current overview on the
>>>> state of the Semantic Web at http://sw.deri.org/2007/06/ontologymap/
>>>>
>>>> Here you can see a graphical representation which Andreas Harth and I
>>>> have created showing the most commonly occurring classes and the
>>>> frequency of links between them.
>>>>
>>>> Any feedback or questions are welcome.
>>>>
>>>> Sheila
>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>
Received on Sunday, 29 July 2007 15:16:10 UTC