- From: Edward Z. Yang <edwardzyang@thewritingpot.com>
- Date: Sun, 14 Dec 2008 16:37:40 -0500
Hello all, I was curious to know how stable/complete HTML 5's tokenizing and DOM algorithms are (specifically section 8). A cursory glance through the section reveals a few red warning boxes, but these are largely issues of whether or not the specification should follow browser implementations, and not actual errors in the specification. The reason I'd like to know this is because I am the author of a tool named HTML Purifier, which takes user-input HTML and cleans it for standards-compliance as well as XSS. We insist on output being standards compliant, because the result is unambiguous. As far as I can tell, this is quite unlike the tools that HTML5 is tooled towards; compliance checkers, user agents and data miners. There certainly is overlap: we have our own parsing and DOM-building algorithms which work decently well, although they do trip up on a number of edge-cases (active formatting elements being one notable example). However, using the HTML5 algorithm wholesale is not possible for several reasons: 1. Users input HTML fragments, not actual HTML documents. A parser I would use needs to be able to enter parsing in a specific state, and has to ignore any requests by the user to exit that state (i.e. a </body> tag) 2. No one actually codes their HTML in HTML5 (yet), so the only parts of the algorithm I want to use are the ones that are emulating browser behavior with HTML4. However, HTML5 interweaves it's additions with the browser research it has done. I'd be really interested to hear what you all have to say about this matter. Thanks! Cheers, Edward
Received on Sunday, 14 December 2008 13:37:40 UTC