- From: Lars Aronsson <lars@aronsson.se>
- Date: Thu, 26 Feb 2004 00:45:51 +0100 (CET)
- To: www-talk@w3.org
Lately I have been running a crontab and wget script that pulls RSS feeds from various sources, storing the items in a database with the <link> as the primary key, and adding a timestamp on insert. This means I never store the same link twice. Since some sources don't provide RSS feeds, I also pull HTML pages and use regexps to dig out the <a href="..." >...</a> to get <link> and <title> elements. This is quite useful with sources such as the front page of http://www.washingtonpost.com/ Now, on that newspaper's web page it would also be possible to extract the two lines of plain text following the headline link, for use as a <description>. In this case a <hr> is separating the items and the <description> text follows the <link><title>. For other sources, the HTML might be different. Is there any good utility to extract an RSS feed from an HTML page? With open source code? In Perl? (I just rejoined www-talk after 10 years of absence. I notice this list is still run by Smartlist, and that spam occasionally hits the list. Would it be possible to switch to Mailman and Pipermail? I see less spam on lists that are run my Mailman.) -- Lars Aronsson (lars@aronsson.se) Aronsson Datateknik - http://aronsson.se/ The world's 3rd biggest wiki is in Swedish - http://susning.nu/
Received on Sunday, 29 February 2004 21:29:40 UTC