Re: Size of the Semantic Web was: Semantic Web Ontology Map from Chris Bizer on 2007-07-29 (semantic-web@w3.org from July 2007)

From: Chris Bizer <chris@bizer.de>
Date: Sun, 29 Jul 2007 18:20:44 +0200
To: "Marc Wick" <marc@geonames.org>
Cc: "Andreas Harth" <andreas.harth@deri.org>, <gv@btucson.com>, "Kinsella, Sheila" <sheila.kinsella@deri.org>, "tim finin" <finin@cs.umbc.edu>, <lushan1@umbc.edu>, <semantic-web@w3.org>, <juergen@umbrich.net>, "Linking Open Data" <linking-open-data@simile.mit.edu>
Message-ID: <28cd01c7d1fc$6a778510$c4e84d57@named4gc1asnuj>
Hi Marc,


> Chris
>
> The problem is not writing and running a SW crawler, the problem is being 
> crawled. A provider of a large dataset with millions of URIs simply does 
> not want to be crawled by a SW crawler for lack of computing resources. 
> For a html crawler like google you can live with it since they will bring 
> traffic to your site and your manager will understand that you have to buy 
> additional servers to feed the google crawlers. Not so for SW crawlers. As 
> a data provider you don't want to waste resources on a SW crawler.

I think we should be a bit more optimistic about the Semantic Web.

Once there are proper search engines for Semantic Web content that have user 
interfaces that the average surfer can understand, then a wider audience 
will start to use Semantic Web content then your manager will immediately 
understand why he should invest some money into additional servers.

Look at Google Base as motivating example. They have a bunch of proper user 
interfaces and as their data is being used, companies invest significant 
amounts of effort (money) into representing their data in the format that 
Google likes and in uploading their data into Google Base. The same could 
happen for Semantic Web content once our search engines have better user 
interfaces.

Currently I see the following candidates for achieving the goal of being the 
first Semantic Web search engine "my grandma could understand" as Frederick 
Giasson would say it:

- Zitgist (http://www.zitgist.com/)
- Swoogle (http://swoogle.umbc.edu/)
- SWSE (http://swse.org/)
- Watson (http://kmi-web05.open.ac.uk/WatsonWUI/)

Additional argument for your manager: If you upload your data into Google 
Base, only a single search engine can use your data. If you publish it on 
the Semantic Web, all search engines can use it. Additional benefit: When 
there is enough interesting interlinked data on the Semantic Web, then 
Google will also start to crawl this data. They usually do, if there is 
enough content and they think that the content could be of any value to them 
or their users. Look at RSS/ATOM for an example on this.

> I hope the Semantic Web is not just mirroring an obsolete legacy approach 
> like 'crawling'. The crawling approach is dead, even in the good old html 
> world. Think about all the ajax driven websites. How does a crawler fetch 
> data from an ajax driven site? Well, it does not.
> The answer is the sitemap protocol that allows web masters to communicate 
> with the crawler. The crawler no longer has to blindly follow links, the 
> web master can tell the crawler instead which resources are available, how 
> often each resource is expected to change, the timestamp of the last 
> modification of each resource and so on.
>
> I believe the Semantic Web should at least base on the sitemap protocol, 
> and I hope the Semantic Web will even find better solutions. Let's not 
> forget it is really difficult for improvements in the html world for all 
> the legacy applications that need to be supported. Not so in the semantic 
> web, there aren't any legacy applications and there is room for real 
> innovations. Let's not waste this opportunity.

I completely agree with you on that with the minor difference that I see 
stuff like sitemap protocol only as additional hints for the crawler and not 
as exclusive discovery mechanism. But I think I'm not alone with this 
opinion. If I look at http://www.sitemaps.org/ I read this: "Web crawlers 
usually discover pages from links within the site and from other sites. 
Sitemaps supplement this data to allow crawlers that support Sitemaps to 
pick up all URLs in the Sitemap and learn about those URLs using the 
associated metadata.".

I think spending more thoughts on how Linked Data crawling, RDF dumps and 
SPARQL endpoints should play together is urgently needed. Therefore I think 
Giovanni's work on this 
(http://docs.google.com/View?docid=ajch7tkjqjwz_23g8dfzc) is important and I 
also repeatedly proposed to make this an additional topic at Eric's RDF 
Access to Relational Databases W3C workshop 
http://www.w3.org/2007/03/RdfRDB/.

Cheers

Chris


> Cheers
>
> Marc
>
> Chris Bizer wrote:
>>
>> Hi Andreas,
>>
>>>
>>> Hi Chris,
>>>
>>> all algorithms operating on the Web are incomplete in practice, since
>>> you can never assume complete information on the Web.  Apparently
>>> this is not clear, so I've changed the sentence to "a map of a part
>>> of the Semantic Web" which should be more precise and address your
>>> concern.
>>>
>>> Joshua already indicated that crawling large database-backed sites
>>> one URI at a time is awkward.
>>
>> I have to violently disagree with you on this point.
>>
>> Look at Google or any other search engine on this planet. They are 
>> exectly doing this, which the only difference that they do not request 
>> the RDF representation of a data base record, but its HTML 
>> representation. They do this for millions of database-backed sites and it 
>> works fine.
>>
>> The Semantic Web is just another development step in the overall 
>> development of the Web. Michael K. Bergman has put this nicely in the 
>> "Web in Transition" picture in one of his recent blog posts 
>> (http://www.mkbergman.com/?p=391) where he distinguishes between the 
>> Document Web, the Structured Web, Linked Data and the Semantic Web.
>>
>> Therefore I think that the Semantic Web should mirror sucessful 
>> techniques from the classic document Web. Having hyperlinks between Web 
>> documents and having crawlers follow this hyperlinks is clearly one of 
>> the more sucessful techniques of the classic Web and I therefore do not 
>> see any reason why it should not work for the Semantic Web.
>>
>>>  It puts a high load on the servers
>>> publishing data.  Also, complete crawling of db-backed sites takes
>>> an unacceptable amount of time.  A polite crawler can fetch around
>>> 9k pages per day (let's say with 10 seconds wait time between
>>> requests), which means crawling sites such as geonames or uniprot
>>> serving millions of URIs requires years.
>>
>> I'm quite happy that the times where the complete Semantic Web fitted on 
>> a memory stick are over.
>> Even if this means that people who publish larger dataset and people who 
>> crawl these datasets have to buy proper hardware.
>>
>> Scalability really should not be the issue when we discuss best practices 
>> for the Semantic Web. You claimed that your YARS store can handle 8 
>> billion triples. Orri from Openlink is currently working on cluster 
>> features for Virtuoso which will also enable queries over billions of 
>> triples. At WWW2007, the Freebase guys where completely relaxed when I 
>> asked them whether they can store billions of triples.
>>
>> The average size of a Google database is 3 Petabyte today. An they are 
>> currently working on bringing together about 100 of these databases, see
>> http://video.google.com/videoplay?docid=-2727172597104463277
>>
>> I think that for the Semantic Web to be relevant for the average user, we 
>> clearly have to aim a such dimension and should not be scared by 30 
>> million documents and the time it would take to crawl them. Maybe we 
>> should better have a look on how Google's robots manage to crawl billions 
>> of documents including a far portion that is also served by slow servers.
>>
>>> For these reasons, we
>>> currently follow rdfs:seeAlso links and thus do not yet include
>>> complete "Linking Open Data" sites in the map.
>>>
>>> I believe that crawlers have slightly different sets of requirements
>>> than visual RDF browsers when it comes to sites that dump huge
>>> amounts of data to the Web.  This is true for both linked data and
>>> rdfs:seeAlso based approaches.  The sitemap extension [1] is one
>>> potential way of helping crawlers operate more efficiently on
>>> the Semantic Web, but I'm sure there are other solutions to the
>>> problem as well.
>>
>> Yes, I think Giovani's work on the site map extension is very important 
>> and can provide a valuable shortcut for crawlers, but I also think that 
>> the classic document Web would not be where it is today if the crawlers 
>> where scared of hitting some data and getting it piece by piece.
>>
>> Andreas, don't take all of this as personal criticism. I really 
>> appreciate your work on YARS and SWSE and think that with these 
>> components you are in the position to be one of first guys that could 
>> build a proper Semantic Web search engine.
>>
>> Cheers
>>
>> Chris
>>
>>> Regards,
>>> Andreas.
>>>
>>> [1] http://sw.deri.org/2007/07/sitemapextension/
>>>
>>> Chris Bizer wrote:
>>>>
>>>> Hi Sheila and all,
>>>>
>>>> it is a great idea to try to draw a map of the Semantic Web and to
>>>> provide people with a place to refer to in order to see the Semantic
>>>> Web grow.
>>>> So great idea!
>>>>
>>>> But what confuses me a bit is your claim that this is a map of THE
>>>> Semantic Web as June 2007.
>>>>
>>>> You have got 200 000 RDF files.
>>>>
>>>> If you look at Swoogle's statistic page
>>>> http://swoogle.umbc.edu/index.php?option=com_swoogle_stats&Itemid=8,
>>>> you see that they have 1.2 million files amounting to 436 million
>>>> triples.
>>>>
>>>> If you look at the Linking Open Data project page
>>>> http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
>>>> you will see that there are over one billion triples and I would guess
>>>> that the different servers within the project serve around 30 million
>>>> RDF documents to the Web.
>>>>
>>>> So my guess would be that:
>>>>
>>>> - your dataset covers less than 1% of the Semantic Web
>>>> - Swoogle covers about 4 % of the Semantic Web
>>>>
>>>> as of June 2007.
>>>>
>>>> So I think it would be important that people who claim to cover the
>>>> whole Semantic Web would give some details about the crawling
>>>> algorithms they use to get their datasets so that it is possible to
>>>> judge the accuracy of their results.
>>>>
>>>> The datasources in the Linking Open Data project are all interlinked
>>>> with RDF links. So it is possible to crawl all 30 million documents by
>>>> following these links. Good starting points for a crawl are URIs
>>>> identifying concepts from different domains within DBpedia, as they
>>>> are interlinked with many other data sets.
>>>>
>>>> Some background information about the idea of RDF Links and how
>>>> crawlers can follow these links are found in
>>>> http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LinkedDataTutorial/
>>>>
>>>> Cheers
>>>>
>>>> Chris
>>>>
>>>>
>>>> -- 
>>>> Chris Bizer
>>>> Freie Universität Berlin
>>>> +49 30 838 54057
>>>> chris@bizer.de
>>>> www.bizer.de
>>>> ----- Original Message ----- From: "Golda Velez" <w3@webglimpse.org>
>>>> To: "Kinsella, Sheila" <sheila.kinsella@deri.org>
>>>> Cc: <semantic-web@w3.org>
>>>> Sent: Saturday, July 28, 2007 5:31 AM
>>>> Subject: Re: Semantic Web Ontology Map
>>>>
>>>>
>>>>
>>>> Very cool!  Is there also a text representation of this graph 
>>>> available?
>>>>
>>>> Thanks!
>>>>
>>>> --Golda
>>>>
>>>>>
>>>>> Dear all,
>>>>>
>>>>> For those of you who are interested in seeing what type of RDF data is
>>>>> available on the web as of now, we provide a current overview on the
>>>>> state of the Semantic Web at http://sw.deri.org/2007/06/ontologymap/
>>>>>
>>>>> Here you can see a graphical representation which Andreas Harth and I
>>>>> have created showing the most commonly occurring classes and the
>>>>> frequency of links between them.
>>>>>
>>>>> Any feedback or questions are welcome.
>>>>>
>>>>> Sheila
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
Received on Sunday, 29 July 2007 16:21:13 UTC