Re: HTML 5 from T.J. Crowder on 2010-04-08 (public-html-comments@w3.org from April 2010)

From: T.J. Crowder <tj@crowdersoftware.com>
Date: Thu, 8 Apr 2010 10:02:48 +0100
To: Eduard Pascual <herenvardo@gmail.com>
Cc: gesteehr@googlemail.com, public-html-comments@w3.org
Message-ID: <g2pc95470a1004080202i2f0efdaeo3b054c25af87ffaf@mail.gmail.com>
HI,

Honestly, I think you are missing a key point on the CDATA concept:

when dealing with untrusted content (like user input), CDATA and

typical sanitizing are *exclusive* alternatives:


No, I'm not missing that point at all. The sole point I have been making was
in relation to your claim that it makes sanitizing easier. I don't think it
does at all, I think it's at best about the same as things are currently,
and possibly worse than that, a new thing that sanitizers need to add to
their arsenal. Since there are probably thousands of hand-crafted sanitizers
out there (sadly), making new things for them to deal with is something to
approach with caution.

I completely understand your point that *if* you put user-generated content
within a CDATA block, the sanitizing rules change. But

1. I don't think the rules are that much simpler, replacing ]]> vs.
replacing <, &, and (IMHO) >. (You have to deal with characters that need to
be entities or what-have-you in the encoding of the page *regardless*; you
can't dump out characters that are invalid in your encoding in a CDATA block
any more than you can elsewhere).

2. I don't see most sites that incorporate user-generated content using
CDATA blocks to render that content. How many sites these days incorporate
user-generated content without allowing *any* form of markup? Markdown,
bbcode, etc.? Not in 2010. Instead, sites incorporate the content by
sanitizing it and then processing the markdown/bbcode/whatever to result in
HTML. So the CDATA block is no help there. (This doesn't change its
applicability to Georg's original use case. Again, all I'm addressing here
is the claim it makes sanitizing easier.)

So since my position is that I (and others) am not going to use CDATA blocks
for user-generated content, I still have to sanitize what they provide. My
concern was that I didn't want new rules to worry about (if the user
includes CDATA blocks, perhaps as an attack vector), because of the
thousands of hand-crafted sanitizers out there.

As I mentioned in my follow-up post (you seem not to have seen it), on
reflection I'm not sure how much of a problem it is. A sanitizer that does a
thorough job will (in my view) escape <, &, and > at a minimum, and so a
user-contributed CDATA block won't be a CDATA block by the time we render
it, because the < will be an entity. That leaves the ]]> at the end. With a
good sanitizer, that will become ]]&gt;, but let's assume for the moment
that we're dealing with a site using an "only okay" sanitizer that only does
< and &, but not > (there are *lots* of these out there) and so the ]]> is
left as-is. Unless that site is *also* putting that user-generated content
in a CDATA block, I *think* that's harmless. If the site with the "only
okay" sanitizer *is* putting it in a CDATA block, well, then they should be
aware of the issue and deal with it.

So I don't think it makes sanitizing any easier, but having worked it
through, I don't *think* it makes it any harder, either, unless you start
using CDATAs to render user-contributed content, in which case you're aware
of the new feature and should be dealing with it appropriately.

On a completely separate point, going back to Georg's use case: How would he
include characters that can't be expressed except via entities in the
encoding he's using for his page? If I'm reading this section[1] of the XML
spec correctly (and that's by no means certain!), you can't use HTML
entities like &aacute; (not the best example, but you get the idea) in CDATA
sections (this based on the statement that <<*...left angle brackets and
ampersands may occur in their literal form; they need not (and cannot) be
escaped using "&lt;" and "&amp;"*>> the key part of that being "and
cannot"). But I'm having trouble reconciling that with this section[2] which
says <<*The right angle bracket (>) may be represented using the string
"&gt;", and must, for compatibility, be escaped using either "&gt;" or a
character reference when it appears in the string "]]>" in content, when
that string is not marking the end of a CDATA section.*>> You know more
about CDATA sections than I do, can you clarify that for me?

In any case, as I've said before, I can certainly see the case for CDATA
over a new tag with special rules. *If* something needs to be done at all.

[1] http://www.w3.org/TR/REC-xml/#sec-cdata-sect
[2] http://www.w3.org/TR/REC-xml/#dt-chardata
--
T.J. Crowder
Independent Software Consultant
tj / crowder software / com
www.crowdersoftware.com


On 8 April 2010 01:13, Eduard Pascual <herenvardo@gmail.com> wrote:

> Honestly, I think you are missing a key point on the CDATA concept:
> when dealing with untrusted content (like user input), CDATA and
> typical sanitizing are *exclusive* alternatives: you shouldn't
> sanitize content that will go inside a <![CDATA[ ... ]]> block, other
> than replacing any "]]>" as I already described; and you shouldn't
> sanitize the resulting CDATA block either.
> The whole beautiful thing about CDATA is that nothing, absolutely
> nothing, after the <![CDATA[ opener is parsed at all, until the
> closing ]]> is found. That's exactly what CDATA does; nothing more,
> and nothing less. And that's why it doesn't need sanitizing: the
> content is taken as plain text, not as markup, regardless of how
> markup'ish it may look. The replacement "]]>" => "]]>]]&gt;<![CDATA["
> I suggested previously isn't really an "escape", in the pure sense of
> the term: it just closes the CDATA block (with the "]]>" part),
> inserts the "]]&gt;" to get it rendered as "]]>" (since it's outside
> the CDATAs, the entity reference is expanded as usual), and then opens
> a new CDATA block (with the remaining "<![CDATA[" part of the
> replacement) to enclose the remaining content. That's why I said that
> it is "dodged" rather than "escaped".
>
> On Wed, Apr 7, 2010 at 10:45 PM, T.J. Crowder <tj@crowdersoftware.com>
> wrote:
> >> For content generated programatically, it's quite indifferent to use
> >>
> >> CDATA or to escape stuff.
> >
> > No, there's a very large difference. Currently, if I have user-generated
> > content,
> It seems I owe you a bit of clarification here: I tried to
> differentiate between purely programatic content and user-provided
> content. The former would include stuff such as a date string
> generated from a timestamp, an include that chooses one of a reduced
> set of static files, or geolocation info from the client's IP address,
> to put some examples; this is stuff generated by scripts without any
> direct contribution from the user. On these cases, the program/script
> will already know what kind of content it's generating, and what
> should it escape; and most server-side scripting technologies provide
> a variety of facilities to handle escaping when it is needed; so CDATA
> should be irrelevant for that kind of content.
>
> For the case of user-provided content, CDATA isn't that much better
> than a good sanitizer, but it's still as good as a sanitizer: putting
> the user stuff inside the CDATA block ensures that nothing within it
> will have any special meaning for the browser. You only need to
> prevent the user from closing the CDATA block with a "]]>" (which
> would allow adding actual code afterwards), which is achieved with a
> single string replacement operation. However, the goal of the CDATA
> proposal is *not* to sanitize content.
>
> Looking at the original post:
> On Tue, Apr 6, 2010 at 5:00 PM, Georg <gesteehr@googlemail.com> wrote:
> > If you want to present HTML-code in a browser, you have to write &lt;
> > instead of <.
> You see, I was not making my assumption on the blind. I was just
> trying to address the need exposed by Georg. Since escaping wouldn't
> be a real problem if the content was script-generated, I *did* assume
> that the focus was on hand-authored content.
> I am suggesting CDATA because it was created exactly to address this
> kind of need (saving the pain of manually escaping everything when
> lots of special chars need to be rendered as content).
>
> There is no mention to user-provided content on Georg's post. I was
> the one who mentioned it. And I did that to highlight that Georg's
> proposed solution (ignoring special chars within a <code> element with
> a specific attribute) would open up injection risks if used for
> user-provided content (and Murphy's Law requires us to assume that,
> sooner or later, someone would do that if it's allowed). However, this
> is only a side benefit of CDATA over Georg's proposal.
>
> Other benefits of CDATA are:
> - Already works on XHTML (only for documents served with an XHTML
> media type, such as application/xhtml+xml, and hence not for IE, which
> doesn't support XHTML media types).
> - Is valid (ie: allowed as per the specs) for all versions of X/HTML,
> with the sole exception of "non-X" HTML5, which disallows it quite
> explicitly. All versions of XHTML support it because it's defined on
> XML itself; and pre-HTML5 versions support it because they are spec'ed
> as SGML-based, and SGML also defines <![CDATA[...]]>.
> - It allows for greater flexibility: CDATA blocks aren't bound to any
> specific element. Actually, an element may contain both CDATA blocks
> and structured children. If the content that needs to be CDATA'ed
> actually matches an element, it's enough to wrap the block with such
> element.
> - In the event a CDATA block needs to contain the closing "]]>"
> sequence, this can be achieved (even if the syntax to do that is a bit
> verbose, this is a corner case and is solvable). On the other hand,
> for the <code type="html"> suggestion, it's impossible to define it in
> a way that allows all potential uses of "</code>" within the element
> (note that it would prevent escaping the "<" on that tag, since entity
> references would be taken literally instead of as actual references).
>
> The main drawback of CDATA in non-X HTML is the lack of browser
> support (despite it has been part of the HTML standards since HTML 2
> or even earlier). However, any possible solution to this use-case will
> suffer from the same issue, so we will have to wait for wide browser
> support on any case.
>
> Note: I know that HTML5 is *not* SGML, and I understand the reasons
> for that choice. I'm not asking for HTML5 to be SGML; but only
> proposing to scavenge one specific feature from SGML because it is a
> good solution to the use-case described, and it has the benefits
> listed above.
>
> Regards,
> Eduard Pascual
>
Received on Thursday, 8 April 2010 09:11:44 UTC