- From: T.J. Crowder <tj@crowdersoftware.com>
- Date: Wed, 7 Apr 2010 16:49:32 +0100
- To: Eduard Pascual <herenvardo@gmail.com>
- Cc: gesteehr@googlemail.com, public-html-comments@w3.org
- Message-ID: <w2nc95470a1004070849w207fac15l378dfff187e61e2d@mail.gmail.com>
> > > What makes ]]> easier to defend against than </code>? As I said, with <![CDATA[ ... ]]> you only need to care about the exact sequence "]]>": if it's found within an input, get rid of it or somehow fix it (string replacement "]]>" => "]]>]]><![CDATA[" gets the job done safely). With </code> (or even with Arthur's <cdata> suggestion, to some degree), things are quite more complex: I don't understand. Any sanitizer *has* to escape < and &. Anything that only escapes ]]> is worthless. So I don't think the argument that ]]> is easier for sanitizers to escape holds up. Sanitizers have to do a full job regardless. Agreed that for *human beings*, it's easier to just worry about ]]> when typing code within HTML rather than worrying about all < and & instances. If anything were done, I'd prefer the CDATA approach to the <code>...</code> (or any other tag) approach for the reasons I outlined earlier. I'm not persuaded that anything needs to be done, although the truly massive number of unescaped &'s out there suggests that people aren't using tools enough or our tools are inadequate. (I fall into the former category -- for shame! -- perhaps because of the latter issue.) But I'm not the one who needs persuading. -- T.J. Crowder Independent Software Consultant tj / crowder software / com www.crowdersoftware.com On 7 April 2010 16:19, Eduard Pascual <herenvardo@gmail.com> wrote: > (Note: I made a recurrent typo on my previous e-mails: XML's CDATA tag > is spelled <![CDATA[ ... ]]> rather than <[CDATA[ ... ]]>. The "<!" > sequence is a legacy from SGML's obscure features. My apologies if > those mistakes caused any issue; although I hope the idea behind my > posts was clear enough.) > > On Wed, Apr 7, 2010 at 7:49 AM, T.J. Crowder <tj@crowdersoftware.com> > wrote: > >> <[CDATA[ ... ]]>. This is far easier to > >> > >> sanitize (you just need to ensure that the input doesn't include the > >> > >> "]]>" sequence), thus being more usable on user-provided content. > > > > What makes ]]> easier to defend against than </code>? > As I said, with <![CDATA[ ... ]]> you only need to care about the > exact sequence "]]>": if it's found within an input, get rid of it or > somehow fix it (string replacement "]]>" => "]]>]]><![CDATA[" gets > the job done safely). With </code> (or even with Arthur's <cdata> > suggestion, to some degree), things are quite more complex: > 1) an instance of the "</code>" string may be legitimate within the > content (if it closes a matching <code ...> within the content). > 2) due to HTML5's error-handling rules, something other than "</code>" > may end up closing the initial <code ...>, so a sanitizer would have > to implement the error-handling rules and play really smart to handle > those cases. I don't know the rules down to the detail, but IIRC > something like this: <div> <code> </div> would have the <code> element > implicitly closed just before the </div>. > > Definitely, a simple string replace is far easier to implement than > these complex rules. > > > In fact, I think ]]> > > is probably more susceptible to injection vulnerabilities in the wild > than > > </code> for the simple reason that some naive HTML encoders only encode > > ampersand and <, not >, and so while the latter would be defeated by > extant > > software (turned into </code>), > If you are using that kind of sanitizers, the feature is not needed at > all: the "dangerous" characters in the content will be escaped. What > we are discussing here is a mechanism to save the need for escaping on > cases where it would become too error-prone and kill the HTML's > readability, due to the huge amount of characters to escape. The main > example/use-case is when attempting to render HTML samples as part of > a HTML document: for snippets that go deeper or longer than a few > elements total, the amount of escapes needed goes nuts. > > In fact, if you run a simple HTML sanitizer and use its output as > CDATA (regardless of how such CDATA is included into the document), > you would see something like "<code>...</code>" showing up on > your screen when the expectation was to have "<code>...</code>". > > To make things tougher, we still have to remember that: > 1) A snippet within <code> </code> may be a code fragment, rather than > stand-alone stuff. For example, it could make sense to write something > like this: > <p>To embed code within your document, you wrap it within > <code><code></code> and <code></code></code> tags.</p> > In the case above things are simple enough to use escaping; but you > can see how insane things would go if the code in there was > un-escaped: there is no sane way a browser could figure out whether > these are fragments, errors, attack attempts, or something else. > 2) The content may include naive mistakes which are *not* intended to > be attacks but may trigger an injection anyway. > 3) A smart attacker might attempt to make the attack pass as errors or > fragments to try to dodge sanitizers. > > Given any input, after a single str_replace(...) call you can put it > inside <![CDATA[ ... ]]> and rest assured that the result will look in > HTML exactly as the input would have looked in plain text (plus > styling, of course). > > > the former would not be. > > But as a software engineer, I'm not excited by <code> having special > > encoding characteristics. In addition to treating < as already-escaped, > > would & be treated as already-escaped as well? What does that mean for > > entities? This makes life very difficult for people trying to pre-process > > user-generated content for display within HTML, the rules become markedly > > more complex. > That's why I'm suggesting <![CDATA[ ... ]]> for the job. It has > several advantages: > 1) It's already in XML, and hence in XHTML. IIRC, in theory it's also > in HTML4 and earlier, because it's part of SGML (but most probably > unsupported, since none of the major browsers has ever been > implemented with a real SGML parser). > 2) It's simple to sanitize (a single string replacement operation) and > to foolproof. > 3) It works independently of the document structure: it allows for a > fragment of an element's content to be wrapped, while leaving other > parts of the same element out. So a single element can contain CDATA > and child elements. > 4) The closer string can be properly "escaped" (more exactly, dodged) > if there is a need to include it within the CDATA content. > > I'll admit that points 2 and 4 might clash on some corner cases, but > even then a bit of regular expression matching can get the job > properly done. > > Using a special element for this task (like <cdata>) or an attribute > to mark especial treatment for an arbitrary elements (like <code > type="...">) would kill all the above benefits: would it yield any > advantage to justify that?. > > > Regards, > Eduard Pascual >
Received on Wednesday, 7 April 2010 15:50:28 UTC