W3C home > Mailing lists > Public > public-html-comments@w3.org > April 2010

Re: HTML 5

From: Michael A. Peters <mpeters@mac.com>
Date: Wed, 07 Apr 2010 10:45:55 -0700
Message-id: <4BBCC4D3.6000205@mac.com>
To: Eduard Pascual <herenvardo@gmail.com>
Cc: "T.J. Crowder" <tj@crowdersoftware.com>, gesteehr@googlemail.com, public-html-comments@w3.org
Eduard Pascual wrote:
> (Note: I made a recurrent typo on my previous e-mails: XML's CDATA tag
> is spelled <![CDATA[ ... ]]> rather than <[CDATA[ ... ]]>. The "<!"
> sequence is a legacy from SGML's obscure features. My apologies if
> those mistakes caused any issue; although I hope the idea behind my
> posts was clear enough.)
> On Wed, Apr 7, 2010 at 7:49 AM, T.J. Crowder <tj@crowdersoftware.com> wrote:
>>> <[CDATA[ ... ]]>.  This is far easier to
>>> sanitize (you just need to ensure that the input doesn't include the
>>> "]]>" sequence), thus being more usable on user-provided content.
>> What makes ]]> easier to defend against than </code>?
> As I said, with <![CDATA[ ... ]]> you only need to care about the
> exact sequence "]]>": if it's found within an input, get rid of it or
> somehow fix it (string replacement "]]>" => "]]>]]&gt;<![CDATA[" gets
> the job done safely). With </code> (or even with Arthur's <cdata>
> suggestion, to some degree), things are quite more complex:
> 1) an instance of the "</code>" string may be legitimate within the
> content (if it closes a matching <code ...> within the content).
> 2) due to HTML5's error-handling rules, something other than "</code>"
> may end up closing the initial <code ...>, so a sanitizer would have
> to implement the error-handling rules and play really smart to handle
> those cases. I don't know the rules down to the detail, but IIRC
> something like this: <div> <code> </div> would have the <code> element
> implicitly closed just before the </div>.

That's why I just use DOMDocument (libxml2) for all dynamically 
generated code. I don't have to worry about that kind of thing.

User input where markup is allowed is sent through a filter first (html 
tidy in xml mode followed by HTML Purifier) that fixes it for xml sanity 
and then it is imported into a DOM of its own before the node is 
imported into the DOM that is served to the requesting client.

Code injection is a non issue for me.

It's a little slower, but you can cache it once it has been done that 
way making performance an issue only the first time it is assembled or 
Received on Wednesday, 7 April 2010 17:46:37 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:26:26 UTC