Re: Size of the Semantic Web was: Semantic Web Ontology Map from Andreas Harth on 2007-07-30 (semantic-web@w3.org from July 2007)

From: Andreas Harth <andreas.harth@deri.org>
Date: Mon, 30 Jul 2007 16:23:07 +0100
To: Chris Bizer <chris@bizer.de>
CC: gv@btucson.com, "Kinsella, Sheila" <sheila.kinsella@deri.org>, tim finin <finin@cs.umbc.edu>, lushan1@umbc.edu, semantic-web@w3.org, juergen@umbrich.net
Message-ID: <46AE025B.8070504@deri.org>
Hi Chris,

let's assume I want to include the UniProt dataset in my index.

I have the choice between
- millions of lookups and downloading individual pages in a month causing
  high server load
- 12 lookups and downloading the dumps in a day with minimal server load

I'd go for the latter and save myself and the data providers a lot
of bandwidth, CPU time, and headaches.

>From running a crawler for quite some time, I've learned that there
is considerable manual effort and fine-tuning involved.  In fact, Web
search engines provide quite extensive configuration parameters
for their crawlers, see [1] or [2].  I wouldn't be surprised if the big
commercial sites have a dedicated person employed to hand-hold
googlebot and co.

So, I too would like to see established crawling methods such as
robots.txt and sitemap.xml adopted for the Semantic Web.  By
adopted I mean a slight extension to be able to batch-download
data from large sites.  And otherwise doing URI by URI lookups,
possibly cutting down the number of URIs per site to have
the crawler terminate in a decent amount of time.  That's all.
I don't see any disagreement here.

Regards,
Andreas.

[1] http://siteexplorer.search.yahoo.com/
[2] http://www.google.com/webmasters/

Chris Bizer wrote:
> Hi Andreas,
>
>>
>> Hi Chris,
>>
>> all algorithms operating on the Web are incomplete in practice, since
>> you can never assume complete information on the Web.  Apparently
>> this is not clear, so I've changed the sentence to "a map of a part
>> of the Semantic Web" which should be more precise and address your
>> concern.
>>
>> Joshua already indicated that crawling large database-backed sites
>> one URI at a time is awkward.
>
> I have to violently disagree with you on this point.
>
> Look at Google or any other search engine on this planet. They are
> exectly doing this, which the only difference that they do not request
> the RDF representation of a data base record, but its HTML
> representation. They do this for millions of database-backed sites and
> it works fine.
>
> The Semantic Web is just another development step in the overall
> development of the Web. Michael K. Bergman has put this nicely in the
> "Web in Transition" picture in one of his recent blog posts
> (http://www.mkbergman.com/?p=391) where he distinguishes between the
> Document Web, the Structured Web, Linked Data and the Semantic Web.
>
> Therefore I think that the Semantic Web should mirror sucessful
> techniques from the classic document Web. Having hyperlinks between
> Web documents and having crawlers follow this hyperlinks is clearly
> one of the more sucessful techniques of the classic Web and I
> therefore do not see any reason why it should not work for the
> Semantic Web.
>
>>  It puts a high load on the servers
>> publishing data.  Also, complete crawling of db-backed sites takes
>> an unacceptable amount of time.  A polite crawler can fetch around
>> 9k pages per day (let's say with 10 seconds wait time between
>> requests), which means crawling sites such as geonames or uniprot
>> serving millions of URIs requires years.
>
> I'm quite happy that the times where the complete Semantic Web fitted
> on a memory stick are over.
> Even if this means that people who publish larger dataset and people
> who crawl these datasets have to buy proper hardware.
>
> Scalability really should not be the issue when we discuss best
> practices for the Semantic Web. You claimed that your YARS store can
> handle 8 billion triples. Orri from Openlink is currently working on
> cluster features for Virtuoso which will also enable queries over
> billions of triples. At WWW2007, the Freebase guys where completely
> relaxed when I asked them whether they can store billions of triples.
>
> The average size of a Google database is 3 Petabyte today. An they are
> currently working on bringing together about 100 of these databases, see
> http://video.google.com/videoplay?docid=-2727172597104463277
>
> I think that for the Semantic Web to be relevant for the average user,
> we clearly have to aim a such dimension and should not be scared by 30
> million documents and the time it would take to crawl them. Maybe we
> should better have a look on how Google's robots manage to crawl
> billions of documents including a far portion that is also served by
> slow servers.
>
>> For these reasons, we
>> currently follow rdfs:seeAlso links and thus do not yet include
>> complete "Linking Open Data" sites in the map.
>>
>> I believe that crawlers have slightly different sets of requirements
>> than visual RDF browsers when it comes to sites that dump huge
>> amounts of data to the Web.  This is true for both linked data and
>> rdfs:seeAlso based approaches.  The sitemap extension [1] is one
>> potential way of helping crawlers operate more efficiently on
>> the Semantic Web, but I'm sure there are other solutions to the
>> problem as well.
>
> Yes, I think Giovani's work on the site map extension is very
> important and can provide a valuable shortcut for crawlers, but I also
> think that the classic document Web would not be where it is today if
> the crawlers where scared of hitting some data and getting it piece by
> piece.
>
> Andreas, don't take all of this as personal criticism. I really
> appreciate your work on YARS and SWSE and think that with these
> components you are in the position to be one of first guys that could
> build a proper Semantic Web search engine.
>
> Cheers
>
> Chris
>
>> Regards,
>> Andreas.
>>
>> [1] http://sw.deri.org/2007/07/sitemapextension/
>>
>> Chris Bizer wrote:
>>>
>>> Hi Sheila and all,
>>>
>>> it is a great idea to try to draw a map of the Semantic Web and to
>>> provide people with a place to refer to in order to see the Semantic
>>> Web grow.
>>> So great idea!
>>>
>>> But what confuses me a bit is your claim that this is a map of THE
>>> Semantic Web as June 2007.
>>>
>>> You have got 200 000 RDF files.
>>>
>>> If you look at Swoogle's statistic page
>>> http://swoogle.umbc.edu/index.php?option=com_swoogle_stats&Itemid=8,
>>> you see that they have 1.2 million files amounting to 436 million
>>> triples.
>>>
>>> If you look at the Linking Open Data project page
>>> http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
>>>
>>> you will see that there are over one billion triples and I would guess
>>> that the different servers within the project serve around 30 million
>>> RDF documents to the Web.
>>>
>>> So my guess would be that:
>>>
>>> - your dataset covers less than 1% of the Semantic Web
>>> - Swoogle covers about 4 % of the Semantic Web
>>>
>>> as of June 2007.
>>>
>>> So I think it would be important that people who claim to cover the
>>> whole Semantic Web would give some details about the crawling
>>> algorithms they use to get their datasets so that it is possible to
>>> judge the accuracy of their results.
>>>
>>> The datasources in the Linking Open Data project are all interlinked
>>> with RDF links. So it is possible to crawl all 30 million documents by
>>> following these links. Good starting points for a crawl are URIs
>>> identifying concepts from different domains within DBpedia, as they
>>> are interlinked with many other data sets.
>>>
>>> Some background information about the idea of RDF Links and how
>>> crawlers can follow these links are found in
>>> http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LinkedDataTutorial/
>>>
>>> Cheers
>>>
>>> Chris
>>>
>>>
>>> -- 
>>> Chris Bizer
>>> Freie Universität Berlin
>>> +49 30 838 54057
>>> chris@bizer.de
>>> www.bizer.de
>>> ----- Original Message ----- From: "Golda Velez" <w3@webglimpse.org>
>>> To: "Kinsella, Sheila" <sheila.kinsella@deri.org>
>>> Cc: <semantic-web@w3.org>
>>> Sent: Saturday, July 28, 2007 5:31 AM
>>> Subject: Re: Semantic Web Ontology Map
>>>
>>>
>>>
>>> Very cool!  Is there also a text representation of this graph
>>> available?
>>>
>>> Thanks!
>>>
>>> --Golda
>>>
>>>>
>>>> Dear all,
>>>>
>>>> For those of you who are interested in seeing what type of RDF data is
>>>> available on the web as of now, we provide a current overview on the
>>>> state of the Semantic Web at http://sw.deri.org/2007/06/ontologymap/
>>>>
>>>> Here you can see a graphical representation which Andreas Harth and I
>>>> have created showing the most commonly occurring classes and the
>>>> frequency of links between them.
>>>>
>>>> Any feedback or questions are welcome.
>>>>
>>>> Sheila
>>>>
>>>>
>>>
>>>
>>>
>>
Received on Monday, 30 July 2007 15:23:44 UTC