Re: how dirty can the HTML be, and still be RDFa? from Dan Brickley on 2011-11-25 (public-xg-webid@w3.org from November 2011)

From: Dan Brickley <danbri@danbri.org>
Date: Fri, 25 Nov 2011 13:49:06 +0100
To: Peter Williams <home_pw@msn.com>
Cc: "public-xg-webid@w3.org" <public-xg-webid@w3.org>
Message-ID: <CAFNgM+ZYL8vxALHPgRpoXiDRAsarb012oE+BBf+0ByUZHNWSKw@mail.gmail.com>

[snip]

Re dirty HTML, this is a very real issue. HTML documents are usually
pretty crappy, standards-wise.

I'd suggest looking into HTML5's approach. They have a much more
liberal parsing regime than XML (this was one of the major drivers for
the original WHATWG/XHTML fork).

So http://www.w3.org/TR/html5/parsing.html#parsing and nearby define
ways of turning ugly worldy documents into a parsed structure. There's
a parser at http://code.google.com/p/html5lib/ or
http://about.validator.nu/htmlparser/

See also http://ejohn.org/blog/html-5-parsing/

cheers,

Dan

Received on Friday, 25 November 2011 12:49:35 UTC