Re: HTML 5 from Eduard Pascual on 2010-04-07 (public-html-comments@w3.org from April 2010)

From: Eduard Pascual <herenvardo@gmail.com>
Date: Wed, 7 Apr 2010 17:19:18 +0200
To: "T.J. Crowder" <tj@crowdersoftware.com>
Cc: gesteehr@googlemail.com, public-html-comments@w3.org
Message-ID: <u2p6ea53251004070819x1649bb7ci5d302c0bf5f933c3@mail.gmail.com>
(Note: I made a recurrent typo on my previous e-mails: XML's CDATA tag
is spelled <![CDATA[ ... ]]> rather than <[CDATA[ ... ]]>. The "<!"
sequence is a legacy from SGML's obscure features. My apologies if
those mistakes caused any issue; although I hope the idea behind my
posts was clear enough.)

On Wed, Apr 7, 2010 at 7:49 AM, T.J. Crowder <tj@crowdersoftware.com> wrote:
>> <[CDATA[ ... ]]>.  This is far easier to
>>
>> sanitize (you just need to ensure that the input doesn't include the
>>
>> "]]>" sequence), thus being more usable on user-provided content.
>
> What makes ]]> easier to defend against than </code>?
As I said, with <![CDATA[ ... ]]> you only need to care about the
exact sequence "]]>": if it's found within an input, get rid of it or
somehow fix it (string replacement "]]>" => "]]>]]&gt;<![CDATA[" gets
the job done safely). With </code> (or even with Arthur's <cdata>
suggestion, to some degree), things are quite more complex:
1) an instance of the "</code>" string may be legitimate within the
content (if it closes a matching <code ...> within the content).
2) due to HTML5's error-handling rules, something other than "</code>"
may end up closing the initial <code ...>, so a sanitizer would have
to implement the error-handling rules and play really smart to handle
those cases. I don't know the rules down to the detail, but IIRC
something like this: <div> <code> </div> would have the <code> element
implicitly closed just before the </div>.

Definitely, a simple string replace is far easier to implement than
these complex rules.

> In fact, I think ]]>
> is probably more susceptible to injection vulnerabilities in the wild than
> </code> for the simple reason that some naive HTML encoders only encode
> ampersand and <, not >, and so while the latter would be defeated by extant
> software (turned into &lt;/code>),
If you are using that kind of sanitizers, the feature is not needed at
all: the "dangerous" characters in the content will be escaped. What
we are discussing here is a mechanism to save the need for escaping on
cases where it would become too error-prone and kill the HTML's
readability, due to the huge amount of characters to escape. The main
example/use-case is when attempting to render HTML samples as part of
a HTML document: for snippets that go deeper or longer than a few
elements total, the amount of escapes needed goes nuts.

In fact, if you run a simple HTML sanitizer and use its output as
CDATA (regardless of how such CDATA is included into the document),
you would see something like "&lt;code>...&lt;/code>" showing up on
your screen when the expectation was to have "<code>...</code>".

To make things tougher, we still have to remember that:
1) A snippet within <code> </code> may be a code fragment, rather than
stand-alone stuff. For example, it could make sense to write something
like this:
<p>To embed code within your document, you wrap it within
<code>&lt;code&gt;</code> and <code>&lt;/code&gt;</code> tags.</p>
In the case above things are simple enough to use escaping; but you
can see how insane things would go if the code in there was
un-escaped: there is no sane way a browser could figure out whether
these are fragments, errors, attack attempts, or something else.
2) The content may include naive mistakes which are *not* intended to
be attacks but may trigger an injection anyway.
3) A smart attacker might attempt to make the attack pass as errors or
fragments to try to dodge sanitizers.

Given any input, after a single str_replace(...) call you can put it
inside <![CDATA[ ... ]]> and rest assured that the result will look in
HTML exactly as the input would have looked in plain text (plus
styling, of course).

> the former would not be.
> But as a software engineer, I'm not excited by <code> having special
> encoding characteristics. In addition to treating < as already-escaped,
> would & be treated as already-escaped as well? What does that mean for
> entities? This makes life very difficult for people trying to pre-process
> user-generated content for display within HTML, the rules become markedly
> more complex.
That's why I'm suggesting <![CDATA[ ... ]]> for the job. It has
several advantages:
1) It's already in XML, and hence in XHTML. IIRC, in theory it's also
in HTML4 and earlier, because it's part of SGML (but most probably
unsupported, since none of the major browsers has ever been
implemented with a real SGML parser).
2) It's simple to sanitize (a single string replacement operation) and
to foolproof.
3) It works independently of the document structure: it allows for a
fragment of an element's content to be wrapped, while leaving other
parts of the same element out. So a single element can contain CDATA
and child elements.
4) The closer string can be properly "escaped" (more exactly, dodged)
if there is a need to include it within the CDATA content.

I'll admit that points 2 and 4 might clash on some corner cases, but
even then a bit of regular expression matching can get the job
properly done.

Using a special element for this task (like <cdata>) or an attribute
to mark especial treatment for an arbitrary elements (like <code
type="...">) would kill all the above benefits: would it yield any
advantage to justify that?.


Regards,
Eduard Pascual
Received on Wednesday, 7 April 2010 15:20:11 UTC