- From: Dan Brickley <danbri@danbri.org>
- Date: Fri, 25 Nov 2011 13:49:06 +0100
- To: Peter Williams <home_pw@msn.com>
- Cc: "public-xg-webid@w3.org" <public-xg-webid@w3.org>
[snip] Re dirty HTML, this is a very real issue. HTML documents are usually pretty crappy, standards-wise. I'd suggest looking into HTML5's approach. They have a much more liberal parsing regime than XML (this was one of the major drivers for the original WHATWG/XHTML fork). So http://www.w3.org/TR/html5/parsing.html#parsing and nearby define ways of turning ugly worldy documents into a parsed structure. There's a parser at http://code.google.com/p/html5lib/ or http://about.validator.nu/htmlparser/ See also http://ejohn.org/blog/html-5-parsing/ cheers, Dan
Received on Friday, 25 November 2011 12:49:35 UTC