W3C home > Mailing lists > Public > public-xg-webid@w3.org > November 2011

Re: how dirty can the HTML be, and still be RDFa?

From: Dan Brickley <danbri@danbri.org>
Date: Fri, 25 Nov 2011 13:49:06 +0100
Message-ID: <CAFNgM+ZYL8vxALHPgRpoXiDRAsarb012oE+BBf+0ByUZHNWSKw@mail.gmail.com>
To: Peter Williams <home_pw@msn.com>
Cc: "public-xg-webid@w3.org" <public-xg-webid@w3.org>
[snip]

Re dirty HTML, this is a very real issue. HTML documents are usually
pretty crappy, standards-wise.

I'd suggest looking into HTML5's approach. They have a much more
liberal parsing regime than XML (this was one of the major drivers for
the original WHATWG/XHTML fork).

So http://www.w3.org/TR/html5/parsing.html#parsing and nearby define
ways of turning ugly worldy documents into a parsed structure. There's
a parser at http://code.google.com/p/html5lib/ or
http://about.validator.nu/htmlparser/

See also http://ejohn.org/blog/html-5-parsing/

cheers,

Dan
Received on Friday, 25 November 2011 12:49:35 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 25 November 2011 12:49:36 GMT