sitemap.py -- crawl web site, record links, make diagram

I started working on my church's web site. Since I don't know what all
is there, I'd like to get a feel for what's there -- a site map, say.

Surely somebody has done this before, but I couldn't
find it, and it was such an obvious hack that I just
wrote it:

 http://www.w3.org/2000/10/swap/util/sitemap.py
 sitemap.py,v 1.3 2003/01/03 04:18:32

It's 185 lines, including comments and debug-print-statements.
(that's in addition to the python standard urllib stuff,
DV's HTML parser and xpath implementation,
and the swap RDF store and serializer)
It took just a few hours to develop. Fun stuff!

You invoke it ala...

  python sitemap.py http://www.fellowshipofgrace.org/ 100 >sitemap.rdf
	 (you need the swap stuff in your PYTHONPATH)

and it crawls the site (up to 100 pages) and records
the titles of the pages (using dc:title) and
the links (using dc:relation). For example:

    <rdf:Description rdf:about="http://www.fellowshipofgrace.org/about_us.html">
        <dc:relation rdf:resource="http://www.efca.org"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/about_us.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/contact.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/god_s_plan.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/index.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/jan1.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/ministries.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/pastors.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/statement.html"/>
        <dc:title>About Us</dc:title>
        <dc:type>text/html</dc:type>
        <label>about_us</label>
    </rdf:Description>

That's an excerpt from
  http://www.fellowshipofgrace.org/2003/maint/sitemap.rdf

Then I used the circles and arrows tools
  http://www.w3.org/2001/02pd/
specifically, these rules
  http://www.w3.org/2001/02pd/sitemap-style.n3
to produce a diagram
  http://www.fellowshipofgrace.org/2003/maint/sitemapFig.svg
  http://www.fellowshipofgrace.org/2003/maint/sitemapFig.ps

Even using short labels, the diagram is busier than I had
hoped/expected, but that's just because there are, in fact, a lot of
links.  This is a pretty small web site; we'd clearly need better
visualization tools for anything larger.

Bonus points to anybody who can make a nicer picture
from the sitemap.rdf file.

p.s. I'm using a mailer I don't usually use, so
apologies for wierd From: headers and such.
Also note that I'm not subscribed to www-rdf-interest,
so please copy me on replies.

-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/

Received on Thursday, 2 January 2003 23:39:29 UTC