Re: HTML 5 from T.J. Crowder on 2010-04-09 (public-html-comments@w3.org from April 2010)

From: T.J. Crowder <tj@crowdersoftware.com>
Date: Fri, 9 Apr 2010 08:02:43 +0100
To: Eduard Pascual <herenvardo@gmail.com>
Cc: gesteehr@googlemail.com, public-html-comments@w3.org
Message-ID: <j2tc95470a1004090002s2728cad4gaed81dbbfd6a99b@mail.gmail.com>
>
> That depends on the cases you are looking at. As I already mentioned,

the injection concern arises only for a corner subset of cases; which

are in addition blatant examples of bad design.


Not at all, I've given a perfectly acceptable example (wrapping user content
is a CDATA -- which actually I thought was *your* suggestion -- to avoid
having to deal with it at all other than ]]>). It's not something I would do
(because I want markup on contributed content), but I wouldn't call it bad
design.

That's the whole thing about the sanitizing issue. It's just a corner

case triggered by such a degree of bad design that the page probably

deserves to miserably break. But CDATA addresses the case, while <code

type="html"> doesn't; so it's a *minor* plus for CDATA over <code...>.


Agreed we've put the issue to rest. Disagree that escaping the content in
some element (which I don't like for other reasons) would be impossible. But
we can just disagree.

I think you're right about what the spec is trying to say about > and CDATA
sections. Could use some wordsmithing. :-) But basically: There is no way
*at all* to render ]]> within a CDATA section, because it ends the section
if it's unescaped, and there is no escaping of any kind within a CDATA
section. The only thing you can do is end the section, emit the string, and
start a new section (your "]]>]]&gt;<![CDATA["), which has implications for
anything working with the document tree, which will not see the text as
continuous. (That's not necessarily a problem.)

No character entities has implications for Georg's authoring tool (a simple
text editor, I guess, since otherwise it would be doing all of the escaping
for him anyway).

I recommend we three (you, Arthur, and I) leave it and encourage someone
else to comment. For my part, I see very limited utility (but yes, a little)
in a mechanism along these lines, and if a mechanism is to be defined at
all, I think it makes sense to use one that's already defined (CDATA) as
opposed to creating something new.
--
T.J. Crowder
Independent Software Consultant
tj / crowder software / com
www.crowdersoftware.com



On 8 April 2010 18:22, Eduard Pascual <herenvardo@gmail.com> wrote:

> On Thu, Apr 8, 2010 at 11:02 AM, T.J. Crowder <tj@crowdersoftware.com>
> wrote:
> > 1. I don't think the rules are that much simpler, replacing ]]> vs.
> > replacing <, &, and (IMHO) >. (You have to deal with characters that need
> to
> > be entities or what-have-you in the encoding of the page *regardless*;
> you
> > can't dump out characters that are invalid in your encoding in a CDATA
> block
> > any more than you can elsewhere).
> That depends on the cases you are looking at. As I already mentioned,
> the injection concern arises only for a corner subset of cases; which
> are in addition blatant examples of bad design. That's why I have said
> many times that it's just a fool-proofing task.
> The specific, nearly insane case, could be described as having
> "hybrid" content where a fragment would go inside a cdata'ish block
> but the rest wouldn't. For example, imagine a discussion board that
> attempts to implement a [code=html] bbcode as a <code type="html">
> element: the part that would go inside [code...]...[/code] shouldn't
> be escaped, since doing so would cause the page to render the escapes
> rather than the entities they represent. This means that the poster
> could add extraneous </code> tags to trigger unexpected behavior of
> the page/UA. Of course, it's possible to escape the </code>'s there to
> prevent this, but then an opening <code> would break everything on the
> page below the [code] block, since the </code> corresponding to the
> closing bbtag would be paired with the inner <code>, not with the
> original opener (since the inner's closer would have been escaped). We
> could nest further "if"s to it, but at the end there is no approach
> that prevents both injection and page breaking.
> Conclusion: it would be a horrible idea to use a cdata'ish feature
> with user-provided content (after all, the content is manipulated
> server-side, so stuff can be programaticaly escaped avoiding the need
> for such a feature on that context). But still, if someone follows
> that horrible idea (Murphy's Law dictates that someone will, sooner or
> later; and the vast size of the WWW hints that it would happen sooner
> rather than later); the <code type="html"> approach is guaranteed to
> raise issues; while the <![CDATA[ approach is not guaranteed (a
> mechanism to sanitize it, while awfully ugly, does exist).
>
> That's the whole thing about the sanitizing issue. It's just a corner
> case triggered by such a degree of bad design that the page probably
> deserves to miserably break. But CDATA addresses the case, while <code
> type="html"> doesn't; so it's a *minor* plus for CDATA over <code...>.
>
> > 2. I don't see most sites that incorporate user-generated content using
> > CDATA blocks to render that content. How many sites these days
> incorporate
> > user-generated content without allowing *any* form of markup? Markdown,
> > bbcode, etc.? Not in 2010. Instead, sites incorporate the content by
> > sanitizing it and then processing the markdown/bbcode/whatever to result
> in
> > HTML. So the CDATA block is no help there. (This doesn't change its
> > applicability to Georg's original use case. Again, all I'm addressing
> here
> > is the claim it makes sanitizing easier.)
> My aplogie, I didn't meant to claim that CDATA makes sanitizing easier
> in general. The actual claim intended (despite my bad wording) was
> that it makes sanitizing possible on some corner cases where <code
> type="html"> makes it impossible
>
> > So since my position is that I (and others) am not going to use CDATA
> blocks
> > for user-generated content, I still have to sanitize what they provide.
> That's what you should do, so go ahead. CDATA is only appropriate as a
> shorthand for manually-authored content.
> > My concern was that I didn't want new rules to worry about (if the user
> > includes CDATA blocks, perhaps as an attack vector), because of the
> > thousands of hand-crafted sanitizers out there.
> > As I mentioned in my follow-up post (you seem not to have seen it), on
> > reflection I'm not sure how much of a problem it is. A sanitizer that
> does a
> > thorough job will (in my view) escape <, &, and > at a minimum, and so a
> > user-contributed CDATA block won't be a CDATA block by the time we render
> > it, because the < will be an entity. That leaves the ]]> at the end. With
> a
> > good sanitizer, that will become ]]&gt;, but let's assume for the moment
> > that we're dealing with a site using an "only okay" sanitizer that only
> does
> > < and &, but not > (there are *lots* of these out there) and so the ]]>
> is
> > left as-is. Unless that site is *also* putting that user-generated
> content
> > in a CDATA block, I *think* that's harmless.
> On the cases you describe, CDATA on itself is harmless. Some issues
> might arise with some weird markup through those "partial" sanitizers,
> but that's independent from CDATA (that'd be the case for code that
> abuses any obscure SGMLish feature that happens to be implementend by
> one or me browsers; so it is rather rare, but possible in theory).
>
> > If the site with the "only
> > okay" sanitizer *is* putting it in a CDATA block, well, then they should
> be
> > aware of the issue and deal with it.
> The nice thing here is that even on such an ugly case CDATA allows for
> a way to have the extraneous "]]>" rendered (and without it being
> taken as the closer), so even when the code sample contains <![CDATA[
> ... ]]> blocks things can work smoothly.
> > So I don't think it makes sanitizing any easier, but having worked it
> > through, I don't *think* it makes it any harder, either, unless you start
> > using CDATAs to render user-contributed content, in which case you're
> aware
> > of the new feature and should be dealing with it appropriately.
>
> > On a completely separate point, going back to Georg's use case: How would
> he
> > include characters that can't be expressed except via entities in the
> > encoding he's using for his page?
> The preferred way would be, by definition, to use an encoding that
> supports the characters needed by the document. The common modern
> defaults of utf-8 and utf-16 support all UNICODE characters, so they
> are a safe bet.
> If that's not possible, (for example, code that's part of a template
> system where the content author has no control over the template), the
> author might fall back to something like this: "]]>&aacute;<![CDATA[".
> That's the same trick as when a "]]>" needs to be included.
> > If I'm reading this section[1] of the XML
> > spec correctly (and that's by no means certain!), you can't use HTML
> > entities like &aacute; (not the best example, but you get the idea) in
> CDATA
> > sections (this based on the statement that <<...left angle brackets and
> > ampersands may occur in their literal form; they need not (and cannot) be
> > escaped using "&lt;" and "&amp;">> the key part of that being "and
> cannot").
> There are two key points here:
> 1) CDATA is *not* an element. It's more like a "command" that tells
> the UA's parser to switch between "normal" and "no-parse" modes or
> states. This means that we can switch between modes, using these
> commands, whenever we feel like it, without having to bother about the
> DOM or the document structure being affected.
> 2) In order to allow unescaped content, escaping needs to be
> explicitly disallowed. This happens for a quite simple reason: if
> escapes were allowed (despite not being needed), then the '&'
> character would need to be escaped as '&amp' when it's not introducing
> an escape. This means that some escaping would be required within
> CDATA, which would hence kill the whole purpose of CDATA.
> So, how to address the issue? We can't have an escape within CDATA
> mode. No problem: just leave CDATA mode, add the escape, and enter
> CDATA mode again. Rinse and repeat as needed.
>
> This may seem too cumbersome, but let's keep in mind that the whole
> purpose of CDATA is to avoid escaping. If someone is going to need
> escaping nevertheless (due to document encoding limitations), then
> that person should review whether CDATA is worth the effort for that
> particular document.
>
> > But I'm having trouble reconciling that with this section[2] which says
> > <<The right angle bracket (>) may be represented using the string "&gt;",
> > and must, for compatibility, be escaped using either "&gt;" or a
> character
> > reference when it appears in the string "]]>" in content, when that
> string
> > is not marking the end of a CDATA section.>> You know more about CDATA
> > sections than I do, can you clarify that for me?
> Don't quote me on this (I have some experience working with CDATA and
> other unusal stuff; but I'm not an expert), but I'm quite convinced
> that the spec tries to say that:
> - When "]]>" is found while in CDATA mode, it causes the browser to
> leave CDATA mode.
> - When "]]>" is found while *not* in CDATA mode (a.k.a. while in
> content), the browser goes nuts.
> - When "]]&gt;" is found while in CDATA mode, "]]&gt;" is rendered.
> - When "]]&gt;" is found while *not* in CDATA mode, "]]>" is rendered.
> So, if you want to render "]]>" literally, you should use "]]&gt;"
> outsite of CDATA mode (if you are in CDATA mode, you need to leave
> that mode, add that string, and then enter back into CDATA mode)
>
> Regards,
> Eduard Pascual
>
Received on Friday, 9 April 2010 07:03:38 UTC