RE: google sitemaps and some history of sitemaps [siteData-36] from Dare Obasanjo on 2005-06-08 (www-tag@w3.org from June 2005)

From: Dare Obasanjo <dareo@microsoft.com>
Date: Wed, 8 Jun 2005 09:46:52 -0700
To: "Dan Connolly" <connolly@w3.org>, <www-tag@w3.org>
Message-ID: <830178CE7378FC40BC6F1DDADCFDD1D104D0CCA3@RED-MSG-31.redmond.corp.microsoft.com>

Google Sitemaps also supports RSS and Atom 0.3 so they did consider supporting other syndication formats[0]. According to Greg Stein[1]  they wanted a format that could scale to supporting millions of URLs which he implied the XML syndication formats like the various flavors of RSS and Atom could not. 
 
[0] http://www.google.com/webmasters/sitemaps/docs/en/faq.html#s8
[1] http://www.imc.org/atom-syntax/mail-archive/msg15904.html,
 
-- 
PITHY WORDS OF WISDOM
A meeting is an event at which the minutes are kept and the hours are lost. 

________________________________

From: www-tag-request@w3.org on behalf of Dan Connolly
Sent: Wed 6/8/2005 9:17 AM
To: www-tag@w3.org
Subject: google sitemaps and some history of sitemaps [siteData-36]




So a few days ago, this crossed my desktop from umpteen sources...

"Google Sitemaps is an experiment in web crawling. Using Sitemaps to
inform and direct our crawlers, we hope to expand our coverage of the
web and improve the time to inclusion in our index."
 -- https://www.google.com/webmasters/sitemaps/docs/en/about.html

It's clearly relevant to issue siteData-36
http://www.w3.org/2001/tag/issues.html?type=1#siteData-36


It seems very ironic, to me; W3C held a workshop a while ago...

 Distributed Indexing/Searching Workshop
 May 28-19, 1996 in Cambridge, Massachusetts
 http://www.w3.org/Search/9605-Indexing-Workshop/

Going into that workshop, my sense was that we needed
a simple format for sites to summarize their contents so
that search engines wouldn't have to crawl the whole thing
to figure out what's there. There was a whole session
on this idea...

"The third breakout/writeup session focused on mechanisms to allow
information servers to notify indexers when content changes."
 -- http://www.w3.org/Search/9605-Indexing-Workshop/ExecSummary.html

What I learned at the workshop was: search engines don't care
what you think is interesting about your site; they have their
own idea about what's interesting, mostly based on links from
other parts of the web. They don't crawl your whole site
just because it's there; they focus on pages that have lots of
incoming links and such.

So now to see google making use of a sitemap format 10 years
later kinda blows my mind.

Note that RSS once stood for Rich Site Summary... interesting...
according to Robin Cover, it still does:
  http://www.oasis-open.org/cover/rss.html

I wonder if google considered something RDF-based like RSS
and decided against it or if they just didn't think about it.

--
Dan Connolly, W3C http://www.w3.org/People/Connolly/
D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E

Received on Wednesday, 8 June 2005 16:47:02 UTC