Re: HTML 5 from Eduard Pascual on 2010-04-08 (public-html-comments@w3.org from April 2010)

From: Eduard Pascual <herenvardo@gmail.com>
Date: Thu, 8 Apr 2010 19:22:09 +0200
To: "T.J. Crowder" <tj@crowdersoftware.com>
Cc: gesteehr@googlemail.com, public-html-comments@w3.org
Message-ID: <m2p6ea53251004081022nf8df86bdu2febec4b76c1ed97@mail.gmail.com>
On Thu, Apr 8, 2010 at 11:02 AM, T.J. Crowder <tj@crowdersoftware.com> wrote:
> 1. I don't think the rules are that much simpler, replacing ]]> vs.
> replacing <, &, and (IMHO) >. (You have to deal with characters that need to
> be entities or what-have-you in the encoding of the page *regardless*; you
> can't dump out characters that are invalid in your encoding in a CDATA block
> any more than you can elsewhere).
That depends on the cases you are looking at. As I already mentioned,
the injection concern arises only for a corner subset of cases; which
are in addition blatant examples of bad design. That's why I have said
many times that it's just a fool-proofing task.
The specific, nearly insane case, could be described as having
"hybrid" content where a fragment would go inside a cdata'ish block
but the rest wouldn't. For example, imagine a discussion board that
attempts to implement a [code=html] bbcode as a <code type="html">
element: the part that would go inside [code...]...[/code] shouldn't
be escaped, since doing so would cause the page to render the escapes
rather than the entities they represent. This means that the poster
could add extraneous </code> tags to trigger unexpected behavior of
the page/UA. Of course, it's possible to escape the </code>'s there to
prevent this, but then an opening <code> would break everything on the
page below the [code] block, since the </code> corresponding to the
closing bbtag would be paired with the inner <code>, not with the
original opener (since the inner's closer would have been escaped). We
could nest further "if"s to it, but at the end there is no approach
that prevents both injection and page breaking.
Conclusion: it would be a horrible idea to use a cdata'ish feature
with user-provided content (after all, the content is manipulated
server-side, so stuff can be programaticaly escaped avoiding the need
for such a feature on that context). But still, if someone follows
that horrible idea (Murphy's Law dictates that someone will, sooner or
later; and the vast size of the WWW hints that it would happen sooner
rather than later); the <code type="html"> approach is guaranteed to
raise issues; while the <![CDATA[ approach is not guaranteed (a
mechanism to sanitize it, while awfully ugly, does exist).

That's the whole thing about the sanitizing issue. It's just a corner
case triggered by such a degree of bad design that the page probably
deserves to miserably break. But CDATA addresses the case, while <code
type="html"> doesn't; so it's a *minor* plus for CDATA over <code...>.

> 2. I don't see most sites that incorporate user-generated content using
> CDATA blocks to render that content. How many sites these days incorporate
> user-generated content without allowing *any* form of markup? Markdown,
> bbcode, etc.? Not in 2010. Instead, sites incorporate the content by
> sanitizing it and then processing the markdown/bbcode/whatever to result in
> HTML. So the CDATA block is no help there. (This doesn't change its
> applicability to Georg's original use case. Again, all I'm addressing here
> is the claim it makes sanitizing easier.)
My aplogie, I didn't meant to claim that CDATA makes sanitizing easier
in general. The actual claim intended (despite my bad wording) was
that it makes sanitizing possible on some corner cases where <code
type="html"> makes it impossible

> So since my position is that I (and others) am not going to use CDATA blocks
> for user-generated content, I still have to sanitize what they provide.
That's what you should do, so go ahead. CDATA is only appropriate as a
shorthand for manually-authored content.
> My concern was that I didn't want new rules to worry about (if the user
> includes CDATA blocks, perhaps as an attack vector), because of the
> thousands of hand-crafted sanitizers out there.
> As I mentioned in my follow-up post (you seem not to have seen it), on
> reflection I'm not sure how much of a problem it is. A sanitizer that does a
> thorough job will (in my view) escape <, &, and > at a minimum, and so a
> user-contributed CDATA block won't be a CDATA block by the time we render
> it, because the < will be an entity. That leaves the ]]> at the end. With a
> good sanitizer, that will become ]]&gt;, but let's assume for the moment
> that we're dealing with a site using an "only okay" sanitizer that only does
> < and &, but not > (there are *lots* of these out there) and so the ]]> is
> left as-is. Unless that site is *also* putting that user-generated content
> in a CDATA block, I *think* that's harmless.
On the cases you describe, CDATA on itself is harmless. Some issues
might arise with some weird markup through those "partial" sanitizers,
but that's independent from CDATA (that'd be the case for code that
abuses any obscure SGMLish feature that happens to be implementend by
one or me browsers; so it is rather rare, but possible in theory).

> If the site with the "only
> okay" sanitizer *is* putting it in a CDATA block, well, then they should be
> aware of the issue and deal with it.
The nice thing here is that even on such an ugly case CDATA allows for
a way to have the extraneous "]]>" rendered (and without it being
taken as the closer), so even when the code sample contains <![CDATA[
... ]]> blocks things can work smoothly.
> So I don't think it makes sanitizing any easier, but having worked it
> through, I don't *think* it makes it any harder, either, unless you start
> using CDATAs to render user-contributed content, in which case you're aware
> of the new feature and should be dealing with it appropriately.

> On a completely separate point, going back to Georg's use case: How would he
> include characters that can't be expressed except via entities in the
> encoding he's using for his page?
The preferred way would be, by definition, to use an encoding that
supports the characters needed by the document. The common modern
defaults of utf-8 and utf-16 support all UNICODE characters, so they
are a safe bet.
If that's not possible, (for example, code that's part of a template
system where the content author has no control over the template), the
author might fall back to something like this: "]]>&aacute;<![CDATA[".
That's the same trick as when a "]]>" needs to be included.
> If I'm reading this section[1] of the XML
> spec correctly (and that's by no means certain!), you can't use HTML
> entities like &aacute; (not the best example, but you get the idea) in CDATA
> sections (this based on the statement that <<...left angle brackets and
> ampersands may occur in their literal form; they need not (and cannot) be
> escaped using "&lt;" and "&amp;">> the key part of that being "and cannot").
There are two key points here:
1) CDATA is *not* an element. It's more like a "command" that tells
the UA's parser to switch between "normal" and "no-parse" modes or
states. This means that we can switch between modes, using these
commands, whenever we feel like it, without having to bother about the
DOM or the document structure being affected.
2) In order to allow unescaped content, escaping needs to be
explicitly disallowed. This happens for a quite simple reason: if
escapes were allowed (despite not being needed), then the '&'
character would need to be escaped as '&amp' when it's not introducing
an escape. This means that some escaping would be required within
CDATA, which would hence kill the whole purpose of CDATA.
So, how to address the issue? We can't have an escape within CDATA
mode. No problem: just leave CDATA mode, add the escape, and enter
CDATA mode again. Rinse and repeat as needed.

This may seem too cumbersome, but let's keep in mind that the whole
purpose of CDATA is to avoid escaping. If someone is going to need
escaping nevertheless (due to document encoding limitations), then
that person should review whether CDATA is worth the effort for that
particular document.

> But I'm having trouble reconciling that with this section[2] which says
> <<The right angle bracket (>) may be represented using the string "&gt;",
> and must, for compatibility, be escaped using either "&gt;" or a character
> reference when it appears in the string "]]>" in content, when that string
> is not marking the end of a CDATA section.>> You know more about CDATA
> sections than I do, can you clarify that for me?
Don't quote me on this (I have some experience working with CDATA and
other unusal stuff; but I'm not an expert), but I'm quite convinced
that the spec tries to say that:
- When "]]>" is found while in CDATA mode, it causes the browser to
leave CDATA mode.
- When "]]>" is found while *not* in CDATA mode (a.k.a. while in
content), the browser goes nuts.
- When "]]&gt;" is found while in CDATA mode, "]]&gt;" is rendered.
- When "]]&gt;" is found while *not* in CDATA mode, "]]>" is rendered.
So, if you want to render "]]>" literally, you should use "]]&gt;"
outsite of CDATA mode (if you are in CDATA mode, you need to leave
that mode, add that string, and then enter back into CDATA mode)

Regards,
Eduard Pascual
Received on Thursday, 8 April 2010 17:23:05 UTC