- From: Eduard Pascual <herenvardo@gmail.com>
- Date: Wed, 7 Apr 2010 17:19:18 +0200
- To: "T.J. Crowder" <tj@crowdersoftware.com>
- Cc: gesteehr@googlemail.com, public-html-comments@w3.org
(Note: I made a recurrent typo on my previous e-mails: XML's CDATA tag is spelled <![CDATA[ ... ]]> rather than <[CDATA[ ... ]]>. The "<!" sequence is a legacy from SGML's obscure features. My apologies if those mistakes caused any issue; although I hope the idea behind my posts was clear enough.) On Wed, Apr 7, 2010 at 7:49 AM, T.J. Crowder <tj@crowdersoftware.com> wrote: >> <[CDATA[ ... ]]>. This is far easier to >> >> sanitize (you just need to ensure that the input doesn't include the >> >> "]]>" sequence), thus being more usable on user-provided content. > > What makes ]]> easier to defend against than </code>? As I said, with <![CDATA[ ... ]]> you only need to care about the exact sequence "]]>": if it's found within an input, get rid of it or somehow fix it (string replacement "]]>" => "]]>]]><![CDATA[" gets the job done safely). With </code> (or even with Arthur's <cdata> suggestion, to some degree), things are quite more complex: 1) an instance of the "</code>" string may be legitimate within the content (if it closes a matching <code ...> within the content). 2) due to HTML5's error-handling rules, something other than "</code>" may end up closing the initial <code ...>, so a sanitizer would have to implement the error-handling rules and play really smart to handle those cases. I don't know the rules down to the detail, but IIRC something like this: <div> <code> </div> would have the <code> element implicitly closed just before the </div>. Definitely, a simple string replace is far easier to implement than these complex rules. > In fact, I think ]]> > is probably more susceptible to injection vulnerabilities in the wild than > </code> for the simple reason that some naive HTML encoders only encode > ampersand and <, not >, and so while the latter would be defeated by extant > software (turned into </code>), If you are using that kind of sanitizers, the feature is not needed at all: the "dangerous" characters in the content will be escaped. What we are discussing here is a mechanism to save the need for escaping on cases where it would become too error-prone and kill the HTML's readability, due to the huge amount of characters to escape. The main example/use-case is when attempting to render HTML samples as part of a HTML document: for snippets that go deeper or longer than a few elements total, the amount of escapes needed goes nuts. In fact, if you run a simple HTML sanitizer and use its output as CDATA (regardless of how such CDATA is included into the document), you would see something like "<code>...</code>" showing up on your screen when the expectation was to have "<code>...</code>". To make things tougher, we still have to remember that: 1) A snippet within <code> </code> may be a code fragment, rather than stand-alone stuff. For example, it could make sense to write something like this: <p>To embed code within your document, you wrap it within <code><code></code> and <code></code></code> tags.</p> In the case above things are simple enough to use escaping; but you can see how insane things would go if the code in there was un-escaped: there is no sane way a browser could figure out whether these are fragments, errors, attack attempts, or something else. 2) The content may include naive mistakes which are *not* intended to be attacks but may trigger an injection anyway. 3) A smart attacker might attempt to make the attack pass as errors or fragments to try to dodge sanitizers. Given any input, after a single str_replace(...) call you can put it inside <![CDATA[ ... ]]> and rest assured that the result will look in HTML exactly as the input would have looked in plain text (plus styling, of course). > the former would not be. > But as a software engineer, I'm not excited by <code> having special > encoding characteristics. In addition to treating < as already-escaped, > would & be treated as already-escaped as well? What does that mean for > entities? This makes life very difficult for people trying to pre-process > user-generated content for display within HTML, the rules become markedly > more complex. That's why I'm suggesting <![CDATA[ ... ]]> for the job. It has several advantages: 1) It's already in XML, and hence in XHTML. IIRC, in theory it's also in HTML4 and earlier, because it's part of SGML (but most probably unsupported, since none of the major browsers has ever been implemented with a real SGML parser). 2) It's simple to sanitize (a single string replacement operation) and to foolproof. 3) It works independently of the document structure: it allows for a fragment of an element's content to be wrapped, while leaving other parts of the same element out. So a single element can contain CDATA and child elements. 4) The closer string can be properly "escaped" (more exactly, dodged) if there is a need to include it within the CDATA content. I'll admit that points 2 and 4 might clash on some corner cases, but even then a bit of regular expression matching can get the job properly done. Using a special element for this task (like <cdata>) or an attribute to mark especial treatment for an arbitrary elements (like <code type="...">) would kill all the above benefits: would it yield any advantage to justify that?. Regards, Eduard Pascual
Received on Wednesday, 7 April 2010 15:20:11 UTC