[whatwg] HTML 5 parsing - not just for browsers?

A site I use has recently had a number of holes in their filters that 
allowed users to upload javascript, an obvious security risk. In the 
time I've spent investigating and helping users to protect themselves 
and the website to close the holes, it's become pretty obvious that 
string-based filtering (i.e. disallowing certain strings of text) is 
futile. It's impossible to realize all the strings that would need to be 
blocked, and even if it were done new ones would keep appearing.

A better way to filter user uploaded HTML is to parse it, filter the DOM 
by removing all elements and attributes not on a whitelist, and then 
re-serialize it and output that. That way you can be assured that what's 
outputted is (valid) proven-safe HTML with (in non-empty elements) 
plaintext contents.

So, will the HTML 5 parsing section be of use here? Will it be of use to 
things other than browsers? Are there small differences needed because 
what's being parsed is a document fragment instead of a document? And 
when it's re-serialized, how closely will today's browsers interpret the 
original and the new?


Received on Sunday, 12 February 2006 23:50:57 UTC