Resource discovery from Lars Aronsson on 2004-02-25 (www-talk@w3.org from January to February 2004)

From: Lars Aronsson <lars@aronsson.se>
Date: Thu, 26 Feb 2004 00:45:51 +0100 (CET)
To: www-talk@w3.org
Message-ID: <Pine.LNX.4.44.0402260030190.995-100000@bryant.aronsson.se>

Lately I have been running a crontab and wget script that pulls RSS
feeds from various sources, storing the items in a database with the
<link> as the primary key, and adding a timestamp on insert.  This
means I never store the same link twice.

Since some sources don't provide RSS feeds, I also pull HTML pages and
use regexps to dig out the <a href="..." >...</a> to get <link> and
<title> elements. This is quite useful with sources such as the front
page of http://www.washingtonpost.com/

Now, on that newspaper's web page it would also be possible to extract
the two lines of plain text following the headline link, for use as a
<description>.  In this case a <hr> is separating the items and the
<description> text follows the <link><title>.  For other sources, the
HTML might be different.

Is there any good utility to extract an RSS feed from an HTML page?
With open source code?  In Perl?

(I just rejoined www-talk after 10 years of absence. I notice this
list is still run by Smartlist, and that spam occasionally hits the
list.  Would it be possible to switch to Mailman and Pipermail?  I see
less spam on lists that are run my Mailman.)


-- 
  Lars Aronsson (lars@aronsson.se)
  Aronsson Datateknik - http://aronsson.se/

  The world's 3rd biggest wiki is in Swedish - http://susning.nu/

Received on Sunday, 29 February 2004 21:29:40 UTC