- From: Andreas Harth <andreas.harth@deri.org>
- Date: Mon, 30 Jul 2007 16:23:07 +0100
- To: Chris Bizer <chris@bizer.de>
- CC: gv@btucson.com, "Kinsella, Sheila" <sheila.kinsella@deri.org>, tim finin <finin@cs.umbc.edu>, lushan1@umbc.edu, semantic-web@w3.org, juergen@umbrich.net
Hi Chris, let's assume I want to include the UniProt dataset in my index. I have the choice between - millions of lookups and downloading individual pages in a month causing high server load - 12 lookups and downloading the dumps in a day with minimal server load I'd go for the latter and save myself and the data providers a lot of bandwidth, CPU time, and headaches. >From running a crawler for quite some time, I've learned that there is considerable manual effort and fine-tuning involved. In fact, Web search engines provide quite extensive configuration parameters for their crawlers, see [1] or [2]. I wouldn't be surprised if the big commercial sites have a dedicated person employed to hand-hold googlebot and co. So, I too would like to see established crawling methods such as robots.txt and sitemap.xml adopted for the Semantic Web. By adopted I mean a slight extension to be able to batch-download data from large sites. And otherwise doing URI by URI lookups, possibly cutting down the number of URIs per site to have the crawler terminate in a decent amount of time. That's all. I don't see any disagreement here. Regards, Andreas. [1] http://siteexplorer.search.yahoo.com/ [2] http://www.google.com/webmasters/ Chris Bizer wrote: > Hi Andreas, > >> >> Hi Chris, >> >> all algorithms operating on the Web are incomplete in practice, since >> you can never assume complete information on the Web. Apparently >> this is not clear, so I've changed the sentence to "a map of a part >> of the Semantic Web" which should be more precise and address your >> concern. >> >> Joshua already indicated that crawling large database-backed sites >> one URI at a time is awkward. > > I have to violently disagree with you on this point. > > Look at Google or any other search engine on this planet. They are > exectly doing this, which the only difference that they do not request > the RDF representation of a data base record, but its HTML > representation. They do this for millions of database-backed sites and > it works fine. > > The Semantic Web is just another development step in the overall > development of the Web. Michael K. Bergman has put this nicely in the > "Web in Transition" picture in one of his recent blog posts > (http://www.mkbergman.com/?p=391) where he distinguishes between the > Document Web, the Structured Web, Linked Data and the Semantic Web. > > Therefore I think that the Semantic Web should mirror sucessful > techniques from the classic document Web. Having hyperlinks between > Web documents and having crawlers follow this hyperlinks is clearly > one of the more sucessful techniques of the classic Web and I > therefore do not see any reason why it should not work for the > Semantic Web. > >> It puts a high load on the servers >> publishing data. Also, complete crawling of db-backed sites takes >> an unacceptable amount of time. A polite crawler can fetch around >> 9k pages per day (let's say with 10 seconds wait time between >> requests), which means crawling sites such as geonames or uniprot >> serving millions of URIs requires years. > > I'm quite happy that the times where the complete Semantic Web fitted > on a memory stick are over. > Even if this means that people who publish larger dataset and people > who crawl these datasets have to buy proper hardware. > > Scalability really should not be the issue when we discuss best > practices for the Semantic Web. You claimed that your YARS store can > handle 8 billion triples. Orri from Openlink is currently working on > cluster features for Virtuoso which will also enable queries over > billions of triples. At WWW2007, the Freebase guys where completely > relaxed when I asked them whether they can store billions of triples. > > The average size of a Google database is 3 Petabyte today. An they are > currently working on bringing together about 100 of these databases, see > http://video.google.com/videoplay?docid=-2727172597104463277 > > I think that for the Semantic Web to be relevant for the average user, > we clearly have to aim a such dimension and should not be scared by 30 > million documents and the time it would take to crawl them. Maybe we > should better have a look on how Google's robots manage to crawl > billions of documents including a far portion that is also served by > slow servers. > >> For these reasons, we >> currently follow rdfs:seeAlso links and thus do not yet include >> complete "Linking Open Data" sites in the map. >> >> I believe that crawlers have slightly different sets of requirements >> than visual RDF browsers when it comes to sites that dump huge >> amounts of data to the Web. This is true for both linked data and >> rdfs:seeAlso based approaches. The sitemap extension [1] is one >> potential way of helping crawlers operate more efficiently on >> the Semantic Web, but I'm sure there are other solutions to the >> problem as well. > > Yes, I think Giovani's work on the site map extension is very > important and can provide a valuable shortcut for crawlers, but I also > think that the classic document Web would not be where it is today if > the crawlers where scared of hitting some data and getting it piece by > piece. > > Andreas, don't take all of this as personal criticism. I really > appreciate your work on YARS and SWSE and think that with these > components you are in the position to be one of first guys that could > build a proper Semantic Web search engine. > > Cheers > > Chris > >> Regards, >> Andreas. >> >> [1] http://sw.deri.org/2007/07/sitemapextension/ >> >> Chris Bizer wrote: >>> >>> Hi Sheila and all, >>> >>> it is a great idea to try to draw a map of the Semantic Web and to >>> provide people with a place to refer to in order to see the Semantic >>> Web grow. >>> So great idea! >>> >>> But what confuses me a bit is your claim that this is a map of THE >>> Semantic Web as June 2007. >>> >>> You have got 200 000 RDF files. >>> >>> If you look at Swoogle's statistic page >>> http://swoogle.umbc.edu/index.php?option=com_swoogle_stats&Itemid=8, >>> you see that they have 1.2 million files amounting to 436 million >>> triples. >>> >>> If you look at the Linking Open Data project page >>> http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData >>> >>> you will see that there are over one billion triples and I would guess >>> that the different servers within the project serve around 30 million >>> RDF documents to the Web. >>> >>> So my guess would be that: >>> >>> - your dataset covers less than 1% of the Semantic Web >>> - Swoogle covers about 4 % of the Semantic Web >>> >>> as of June 2007. >>> >>> So I think it would be important that people who claim to cover the >>> whole Semantic Web would give some details about the crawling >>> algorithms they use to get their datasets so that it is possible to >>> judge the accuracy of their results. >>> >>> The datasources in the Linking Open Data project are all interlinked >>> with RDF links. So it is possible to crawl all 30 million documents by >>> following these links. Good starting points for a crawl are URIs >>> identifying concepts from different domains within DBpedia, as they >>> are interlinked with many other data sets. >>> >>> Some background information about the idea of RDF Links and how >>> crawlers can follow these links are found in >>> http://sites.wiwiss.fu-berlin.de/suhl/bizer/pub/LinkedDataTutorial/ >>> >>> Cheers >>> >>> Chris >>> >>> >>> -- >>> Chris Bizer >>> Freie Universität Berlin >>> +49 30 838 54057 >>> chris@bizer.de >>> www.bizer.de >>> ----- Original Message ----- From: "Golda Velez" <w3@webglimpse.org> >>> To: "Kinsella, Sheila" <sheila.kinsella@deri.org> >>> Cc: <semantic-web@w3.org> >>> Sent: Saturday, July 28, 2007 5:31 AM >>> Subject: Re: Semantic Web Ontology Map >>> >>> >>> >>> Very cool! Is there also a text representation of this graph >>> available? >>> >>> Thanks! >>> >>> --Golda >>> >>>> >>>> Dear all, >>>> >>>> For those of you who are interested in seeing what type of RDF data is >>>> available on the web as of now, we provide a current overview on the >>>> state of the Semantic Web at http://sw.deri.org/2007/06/ontologymap/ >>>> >>>> Here you can see a graphical representation which Andreas Harth and I >>>> have created showing the most commonly occurring classes and the >>>> frequency of links between them. >>>> >>>> Any feedback or questions are welcome. >>>> >>>> Sheila >>>> >>>> >>> >>> >>> >>
Received on Monday, 30 July 2007 15:23:44 UTC