All,
the sitemap.xml solution works IF everybody (or most) have the robots.txt or the sitemap.xml at the root directory. So, conceptually speaking, it should be the way to go.

But a quick test on the LOD cloud returned 404 for many if not most sites for both sitemap.xml and robots.txt...
Curiously, for many of those without a sitemap.xml, the <c-name>/sparql URI format to access the SPAQL endpoint DOES work...

So something is still missing. Either each dataspace mantainer that is willing to provide the SPARQL endpoint also provides a (even if minimal) sitemap.xml or voiD description, or at least follows this convention.
This would greatly enhance the accessibility of the data, and enable tools to automatically find them as needed...

Cheers
D

Sergio Fernández wrote:

On Sat, 2009-03-07 at 00:36 -0300, Daniel Schwabe wrote:

I could query the site for its sitemap extension (would it always be 
<home url>/sitemap.xml?


Yes, you can do it in a programmatic way. But that URL (/sitemap.xml),
even it's common used, it's not mandatory, so you can't use it as a
constant. But there is one way, not so direct, but at least one that is
standard:

1) From /robots.txt you can take the Sitemap's URL ("Sitemap:" as [1]
specifies)
2) According the extension proposed by DERI [2], you can check if the
sitemap points a SPARQL enpoint looking for the
sc:sparqlEndpointLocation element.

Hope that helps.

Best,

[1] http://www.sitemaps.org/protocol.php
[2] http://sw.deri.org/2007/07/sitemapextension/

Daniel Schwabe
Tel:+55-21-3527 1500 r. 4356
Fax: +55-21-3527 1530
http://www.inf.puc-rio.br/~dschwabe

Dept. de Informatica, PUC-Rio
R. M. de S. Vicente, 225
Rio de Janeiro, RJ 22453-900, Brasil