- From: dolphinling <dolphinling@myrealbox.com>
- Date: Mon, 13 Feb 2006 02:50:57 -0500
A site I use has recently had a number of holes in their filters that allowed users to upload javascript, an obvious security risk. In the time I've spent investigating and helping users to protect themselves and the website to close the holes, it's become pretty obvious that string-based filtering (i.e. disallowing certain strings of text) is futile. It's impossible to realize all the strings that would need to be blocked, and even if it were done new ones would keep appearing. A better way to filter user uploaded HTML is to parse it, filter the DOM by removing all elements and attributes not on a whitelist, and then re-serialize it and output that. That way you can be assured that what's outputted is (valid) proven-safe HTML with (in non-empty elements) plaintext contents. So, will the HTML 5 parsing section be of use here? Will it be of use to things other than browsers? Are there small differences needed because what's being parsed is a document fragment instead of a document? And when it's re-serialized, how closely will today's browsers interpret the original and the new? -- dolphinling <http://dolphinling.net/>
Received on Sunday, 12 February 2006 23:50:57 UTC