minimalist regular-expression feed triplr

havent tested on all feeds yet, qname resolver for absolute URIs on the predicates is another line but im lazy.. later versions @ http://repo.or.cz/w/element.git?a=blob;f=ruby/W/simple.rb

require 'open-uri'

class String
  def parseFeed
    scan(%r{<(rss:|atom:)?(item|entry)([\s][^>]*)?>(.*?)</\1?\2>}mi){|m| # item
      u = m[2] && (u=m[2].match /about=["']?([^'"]+)/) && u[1] || m[3].match(/id>([^<]+)/)[1] # URI
      m[3].scan(%r{<([a-z:]+)?link ([^>]+)>}mi){|e|yield u,e[1].match(/rel=['"]?([^'"\s]+)/)[1],e[1].match(/href=['"]?([^'"\s]+)/)[1]} # link
      m[3].scan(%r{<([a-z:]+)([\s][^>]*)?>(.*?)</\1>}mi){|e|yield u,e[0].split(/:/)[-1],e[2][0..64]}} end # element
end

irb(main):175:0> open('http://mt-shortwave.blogspot.com/feeds/posts/default').read.parseFeed{|s,p,o|puts [s,p,o[0..64]].join "\t"}
tag:blogger.com,1999:blog-28878961.post-792620716806050191						 alternate	     http://mt-shortwave.blogspot.com/2008/04/bbc-radio-chief-rejects-
tag:blogger.com,1999:blog-28878961.post-792620716806050191						 replies	     http://mt-shortwave.blogspot.com/feeds/792620716806050191/comment
tag:blogger.com,1999:blog-28878961.post-792620716806050191						 self		     http://www.blogger.com/feeds/28878961/posts/default/7926207168060
tag:blogger.com,1999:blog-28878961.post-792620716806050191						 edit		     http://www.blogger.com/feeds/28878961/posts/default/7926207168060
tag:blogger.com,1999:blog-28878961.post-792620716806050191						 id		     tag:blogger.com,1999:blog-28878961.post-792620716806050191
tag:blogger.com,1999:blog-28878961.post-792620716806050191						 published	     2008-04-28T08:33:00.000-07:00
tag:blogger.com,1999:blog-28878961.post-792620716806050191						 updated	     2008-04-28T08:38:24.327-07:00
tag:blogger.com,1999:blog-28878961.post-792620716806050191						 title		     BBC radio chief rejects calls to privatise Radio 1 and Radio 2
tag:blogger.com,1999:blog-28878961.post-792620716806050191						 content	     &lt;a href="http://bp0.blogger.com/_eFGtrBi5YL8/SBXvNzdm8pI/AAAAA
tag:blogger.com,1999:blog-28878961.post-792620716806050191						 author		     <name>Gayle</name>



* had some problems with the existing feed libs
- nothing works on Ruby 1.9 except Simple-RSS and Raptor via Redland-bindings. the Ruby port of Mark Pilgrim's feed parser is close to 200K of source excluding tests - unsurprised something(s) broken..
- Raptor/Redland-bindings is a pain to build on shared hosts, plus im getting symbol-resolution errors linking the latest release versions even on a nice box, plus it segfaults and/or screws up on some nasty feeds, plus it doesnt work on JRuby or Rubinius, plus it requires SWIG and a compiler and -dev libs..
- Simple-RSS does things i dont want/need: creating an intermediary hash from the found 'triples' which id have to deconstruct back into the triples to begin with, plus it turns the strings into ruby objects (requiring more libs and clobbering the original content) when i just wanted the strings to begin with. plus its got a hardcoded set of tags to look for. plus it misses the <link rel= tags from Atom feeds

Received on Sunday, 4 May 2008 10:29:40 UTC