You may have seen: A new Basketball season brings a new episode in the personal information disaster by connolly on Thu, 2006-11-16 12:39 tags: calendar | GRDDL | microformats | RDF | XHTML http://dig.csail.mit.edu/breadcrumbs/node/172 A new schedule came this week, and it had an unexpected linebreak, so I upgraded from tidy and regular expressions to html5lib and ElementTree. This message has most of the raw materials for another breadcrumbs episode... import html5lib # http://code.google.com/p/html5lib/ from html5lib import HTMLParser, treebuilders from xml.etree import cElementTree def parseHTML(fn="bball-practice.html"): """ >>> e = parseHTML() >>> e.tag 'html' >>> rows = e.getiterator('tr') >>> len(list(rows)) 24 """ f = open(fn) parser = HTMLParser(tree=treebuilders.getTreeBuilder("etree", cElementTree)) return parser.parse(f) ... def eachEvent(...): ... for t in elt.getiterator('table'): cell = t.find('tbody/tr/td') if not cell: continue hd = cell.findtext('b') if not hd: continue if 'First Name' in hd: break else: raise ValueError, elt for my reference, some hg logs: 16:ae65b101cf4c 2007-11-18 got html5lib talking with etree 17:5f81574c79fb 2007-11-18 - use html5lib and ElementTree rather than tidy and regular expressions -- Dan Connolly, W3C http://www.w3.org/People/Connolly/ gpg D3C2 887B 0F92 6005 C541 0875 0F91 96DE 6E52 C29EReceived on Sunday, 18 November 2007 15:16:20 UTC
This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 22:33:18 UTC