W3C home > Mailing lists > Public > www-rdf-interest@w3.org > January 2003

sitemap.py -- crawl web site, record links, make diagram

From: <connolly@w3.org>
Date: Thu, 02 Jan 2003 22:39:07 -0600
To: www-rdf-interest@w3.org
cc: connolly@w3.org, em@w3.org
Message-Id: <E18UJbv-0001Ms-00@jammer.dm93.org>

I started working on my church's web site. Since I don't know what all
is there, I'd like to get a feel for what's there -- a site map, say.

Surely somebody has done this before, but I couldn't
find it, and it was such an obvious hack that I just
wrote it:

 sitemap.py,v 1.3 2003/01/03 04:18:32

It's 185 lines, including comments and debug-print-statements.
(that's in addition to the python standard urllib stuff,
DV's HTML parser and xpath implementation,
and the swap RDF store and serializer)
It took just a few hours to develop. Fun stuff!

You invoke it ala...

  python sitemap.py http://www.fellowshipofgrace.org/ 100 >sitemap.rdf
	 (you need the swap stuff in your PYTHONPATH)

and it crawls the site (up to 100 pages) and records
the titles of the pages (using dc:title) and
the links (using dc:relation). For example:

    <rdf:Description rdf:about="http://www.fellowshipofgrace.org/about_us.html">
        <dc:relation rdf:resource="http://www.efca.org"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/about_us.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/contact.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/god_s_plan.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/index.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/jan1.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/ministries.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/pastors.html"/>
        <dc:relation rdf:resource="http://www.fellowshipofgrace.org/statement.html"/>
        <dc:title>About Us</dc:title>

That's an excerpt from

Then I used the circles and arrows tools
specifically, these rules
to produce a diagram

Even using short labels, the diagram is busier than I had
hoped/expected, but that's just because there are, in fact, a lot of
links.  This is a pretty small web site; we'd clearly need better
visualization tools for anything larger.

Bonus points to anybody who can make a nicer picture
from the sitemap.rdf file.

p.s. I'm using a mailer I don't usually use, so
apologies for wierd From: headers and such.
Also note that I'm not subscribed to www-rdf-interest,
so please copy me on replies.

Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Thursday, 2 January 2003 23:39:29 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:07:44 UTC