Re: "cleaning HTML for security"

For a generic API, I would say the main thing is to remove all forms of
scripting (<script> tags, onX attributes, etc.) and all references to
external plugins; these are security holes if they are allowed to run
in the context of your application and are impossible to reliably
sanitise. Browsers implementing this have a huge advantage in that they
already have a full HTML5 parser, so if they need to sanitise an HTML
string they can use this to parse out a tree, then it is trivial to
walk the tree and use a whitelist of allowed tags and attributes to
strip out any scripting elements before serialising it again. External
systems doing this need to be wary of all the ways they can be tricked
into parsing something one way, only for the browser to interpret it
differently, making what you thought was a safe string suddenly be
inside a script tag etc.

One thing I would say is <noscript> contents should be *included* (since
you're removing the scripting) but the tags themselves removed (since it
will probably be injected into a context that *does* support scripting).
We once had a security bug due to the unexpected interaction between
<noscript> and comment tags.

CSS is a tricky one; what should be allowed depends on the context
it will be inserted into. In a webmail system you have to be careful
to avoid conflict with existing styles and not allow the content to
draw over your UI, which could lead to phishing attacks. iframes can
be used to sandbox content, but bring a whole world of other
problems unfortunately. However, the nice thing about CSS is that
the JS of the application handling the paste can be used to sanitise
the CSS as required, so I don't think it's necessary to do this at
the browser level.

Neil.

Received on Monday, 17 November 2014 06:45:29 UTC