W3C home > Mailing lists > Public > www-rdf-interest@w3.org > January 2003

sitemap.py -- crawl web site, record links, make diagram

From: <connolly@w3.org>
Date: Thu, 02 Jan 2003 22:39:07 -0600
To: www-rdf-interest@w3.org
cc: connolly@w3.org, em@w3.org
Message-Id: <E18UJbv-0001Ms-00@jammer.dm93.org>


I started working on my church's web site. Since I don't know what all
is there, I'd like to get a feel for what's there -- a site map, say.

Surely somebody has done this before, but I couldn't
find it, and it was such an obvious hack that I just
wrote it:

 http://www.w3.org/2000/10/swap/util/sitemap.py
 sitemap.py,v 1.3 2003/01/03 04:18:32

It's 185 lines, including comments and debug-print-statements.
(that's in addition to the python standard urllib stuff,
DV's HTML parser and xpath implementation,
and the swap RDF store and serializer)
It took just a few hours to develop. Fun stuff!

You invoke it ala...

  python sitemap.py http://www.fellowshipofgrace.org/ 100 >sitemap.rdf
	 (you need the swap stuff in your PYTHONPATH)

and it crawls the site (up to 100 pages) and records
the titles of the pages (using dc:title) and
the links (using dc:relation). For example:

    <rdf:Description rdf:about="http://www.fellowshipofgrace.org/about_us.html">
        <dc:relation rdf:resource="http://www.efca.org"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/about_us.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/contact.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/god_s_plan.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/index.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/jan1.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/ministries.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/pastors.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/statement.html"/>
        <dc:title>About Us</dc:title>
        <dc:type>text/html</dc:type>
        <label>about_us</label>
    </rdf:Description>

That's an excerpt from
  http://www.fellowshipofgrace.org/2003/maint/sitemap.rdf

Then I used the circles and arrows tools
  http://www.w3.org/2001/02pd/
specifically, these rules
  http://www.w3.org/2001/02pd/sitemap-style.n3
to produce a diagram
  http://www.fellowshipofgrace.org/2003/maint/sitemapFig.svg
  http://www.fellowshipofgrace.org/2003/maint/sitemapFig.ps

Even using short labels, the diagram is busier than I had
hoped/expected, but that's just because there are, in fact, a lot of
links.  This is a pretty small web site; we'd clearly need better
visualization tools for anything larger.

Bonus points to anybody who can make a nicer picture
from the sitemap.rdf file.

p.s. I'm using a mailer I don't usually use, so
apologies for wierd From: headers and such.
Also note that I'm not subscribed to www-rdf-interest,
so please copy me on replies.

-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Thursday, 2 January 2003 23:39:29 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 7 December 2009 10:51:57 GMT