W3C home > Mailing lists > Public > public-html-comments@w3.org > April 2010

Re: HTML 5

From: T.J. Crowder <tj@crowdersoftware.com>
Date: Wed, 7 Apr 2010 16:49:32 +0100
Message-ID: <w2nc95470a1004070849w207fac15l378dfff187e61e2d@mail.gmail.com>
To: Eduard Pascual <herenvardo@gmail.com>
Cc: gesteehr@googlemail.com, public-html-comments@w3.org
> > What makes ]]> easier to defend against than </code>?

As I said, with <![CDATA[ ... ]]> you only need to care about the

exact sequence "]]>": if it's found within an input, get rid of it or

somehow fix it (string replacement "]]>" => "]]>]]&gt;<![CDATA[" gets

the job done safely). With </code> (or even with Arthur's <cdata>

suggestion, to some degree), things are quite more complex:

I don't understand. Any sanitizer *has* to escape < and &. Anything that
only escapes ]]> is worthless. So I don't think the argument that ]]> is
easier for sanitizers to escape holds up. Sanitizers have to do a full job

Agreed that for *human beings*, it's easier to just worry about ]]> when
typing code within HTML rather than worrying about all < and & instances.

If anything were done, I'd prefer the CDATA approach to the <code>...</code>
(or any other tag) approach for the reasons I outlined earlier. I'm not
persuaded that anything needs to be done, although the truly massive number
of unescaped &'s out there suggests that people aren't using tools enough or
our tools are inadequate. (I fall into the former category -- for shame! --
perhaps because of the latter issue.) But I'm not the one who needs
T.J. Crowder
Independent Software Consultant
tj / crowder software / com

On 7 April 2010 16:19, Eduard Pascual <herenvardo@gmail.com> wrote:

> (Note: I made a recurrent typo on my previous e-mails: XML's CDATA tag
> is spelled <![CDATA[ ... ]]> rather than <[CDATA[ ... ]]>. The "<!"
> sequence is a legacy from SGML's obscure features. My apologies if
> those mistakes caused any issue; although I hope the idea behind my
> posts was clear enough.)
> On Wed, Apr 7, 2010 at 7:49 AM, T.J. Crowder <tj@crowdersoftware.com>
> wrote:
> >> <[CDATA[ ... ]]>.  This is far easier to
> >>
> >> sanitize (you just need to ensure that the input doesn't include the
> >>
> >> "]]>" sequence), thus being more usable on user-provided content.
> >
> > What makes ]]> easier to defend against than </code>?
> As I said, with <![CDATA[ ... ]]> you only need to care about the
> exact sequence "]]>": if it's found within an input, get rid of it or
> somehow fix it (string replacement "]]>" => "]]>]]&gt;<![CDATA[" gets
> the job done safely). With </code> (or even with Arthur's <cdata>
> suggestion, to some degree), things are quite more complex:
> 1) an instance of the "</code>" string may be legitimate within the
> content (if it closes a matching <code ...> within the content).
> 2) due to HTML5's error-handling rules, something other than "</code>"
> may end up closing the initial <code ...>, so a sanitizer would have
> to implement the error-handling rules and play really smart to handle
> those cases. I don't know the rules down to the detail, but IIRC
> something like this: <div> <code> </div> would have the <code> element
> implicitly closed just before the </div>.
> Definitely, a simple string replace is far easier to implement than
> these complex rules.
> > In fact, I think ]]>
> > is probably more susceptible to injection vulnerabilities in the wild
> than
> > </code> for the simple reason that some naive HTML encoders only encode
> > ampersand and <, not >, and so while the latter would be defeated by
> extant
> > software (turned into &lt;/code>),
> If you are using that kind of sanitizers, the feature is not needed at
> all: the "dangerous" characters in the content will be escaped. What
> we are discussing here is a mechanism to save the need for escaping on
> cases where it would become too error-prone and kill the HTML's
> readability, due to the huge amount of characters to escape. The main
> example/use-case is when attempting to render HTML samples as part of
> a HTML document: for snippets that go deeper or longer than a few
> elements total, the amount of escapes needed goes nuts.
> In fact, if you run a simple HTML sanitizer and use its output as
> CDATA (regardless of how such CDATA is included into the document),
> you would see something like "&lt;code>...&lt;/code>" showing up on
> your screen when the expectation was to have "<code>...</code>".
> To make things tougher, we still have to remember that:
> 1) A snippet within <code> </code> may be a code fragment, rather than
> stand-alone stuff. For example, it could make sense to write something
> like this:
> <p>To embed code within your document, you wrap it within
> <code>&lt;code&gt;</code> and <code>&lt;/code&gt;</code> tags.</p>
> In the case above things are simple enough to use escaping; but you
> can see how insane things would go if the code in there was
> un-escaped: there is no sane way a browser could figure out whether
> these are fragments, errors, attack attempts, or something else.
> 2) The content may include naive mistakes which are *not* intended to
> be attacks but may trigger an injection anyway.
> 3) A smart attacker might attempt to make the attack pass as errors or
> fragments to try to dodge sanitizers.
> Given any input, after a single str_replace(...) call you can put it
> inside <![CDATA[ ... ]]> and rest assured that the result will look in
> HTML exactly as the input would have looked in plain text (plus
> styling, of course).
> > the former would not be.
> > But as a software engineer, I'm not excited by <code> having special
> > encoding characteristics. In addition to treating < as already-escaped,
> > would & be treated as already-escaped as well? What does that mean for
> > entities? This makes life very difficult for people trying to pre-process
> > user-generated content for display within HTML, the rules become markedly
> > more complex.
> That's why I'm suggesting <![CDATA[ ... ]]> for the job. It has
> several advantages:
> 1) It's already in XML, and hence in XHTML. IIRC, in theory it's also
> in HTML4 and earlier, because it's part of SGML (but most probably
> unsupported, since none of the major browsers has ever been
> implemented with a real SGML parser).
> 2) It's simple to sanitize (a single string replacement operation) and
> to foolproof.
> 3) It works independently of the document structure: it allows for a
> fragment of an element's content to be wrapped, while leaving other
> parts of the same element out. So a single element can contain CDATA
> and child elements.
> 4) The closer string can be properly "escaped" (more exactly, dodged)
> if there is a need to include it within the CDATA content.
> I'll admit that points 2 and 4 might clash on some corner cases, but
> even then a bit of regular expression matching can get the job
> properly done.
> Using a special element for this task (like <cdata>) or an attribute
> to mark especial treatment for an arbitrary elements (like <code
> type="...">) would kill all the above benefits: would it yield any
> advantage to justify that?.
> Regards,
> Eduard Pascual
Received on Wednesday, 7 April 2010 15:50:28 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 1 June 2011 00:14:02 GMT