- From: T.J. Crowder <tj@crowdersoftware.com>
- Date: Fri, 9 Apr 2010 08:02:43 +0100
- To: Eduard Pascual <herenvardo@gmail.com>
- Cc: gesteehr@googlemail.com, public-html-comments@w3.org
- Message-ID: <j2tc95470a1004090002s2728cad4gaed81dbbfd6a99b@mail.gmail.com>
> > That depends on the cases you are looking at. As I already mentioned, the injection concern arises only for a corner subset of cases; which are in addition blatant examples of bad design. Not at all, I've given a perfectly acceptable example (wrapping user content is a CDATA -- which actually I thought was *your* suggestion -- to avoid having to deal with it at all other than ]]>). It's not something I would do (because I want markup on contributed content), but I wouldn't call it bad design. That's the whole thing about the sanitizing issue. It's just a corner case triggered by such a degree of bad design that the page probably deserves to miserably break. But CDATA addresses the case, while <code type="html"> doesn't; so it's a *minor* plus for CDATA over <code...>. Agreed we've put the issue to rest. Disagree that escaping the content in some element (which I don't like for other reasons) would be impossible. But we can just disagree. I think you're right about what the spec is trying to say about > and CDATA sections. Could use some wordsmithing. :-) But basically: There is no way *at all* to render ]]> within a CDATA section, because it ends the section if it's unescaped, and there is no escaping of any kind within a CDATA section. The only thing you can do is end the section, emit the string, and start a new section (your "]]>]]><![CDATA["), which has implications for anything working with the document tree, which will not see the text as continuous. (That's not necessarily a problem.) No character entities has implications for Georg's authoring tool (a simple text editor, I guess, since otherwise it would be doing all of the escaping for him anyway). I recommend we three (you, Arthur, and I) leave it and encourage someone else to comment. For my part, I see very limited utility (but yes, a little) in a mechanism along these lines, and if a mechanism is to be defined at all, I think it makes sense to use one that's already defined (CDATA) as opposed to creating something new. -- T.J. Crowder Independent Software Consultant tj / crowder software / com www.crowdersoftware.com On 8 April 2010 18:22, Eduard Pascual <herenvardo@gmail.com> wrote: > On Thu, Apr 8, 2010 at 11:02 AM, T.J. Crowder <tj@crowdersoftware.com> > wrote: > > 1. I don't think the rules are that much simpler, replacing ]]> vs. > > replacing <, &, and (IMHO) >. (You have to deal with characters that need > to > > be entities or what-have-you in the encoding of the page *regardless*; > you > > can't dump out characters that are invalid in your encoding in a CDATA > block > > any more than you can elsewhere). > That depends on the cases you are looking at. As I already mentioned, > the injection concern arises only for a corner subset of cases; which > are in addition blatant examples of bad design. That's why I have said > many times that it's just a fool-proofing task. > The specific, nearly insane case, could be described as having > "hybrid" content where a fragment would go inside a cdata'ish block > but the rest wouldn't. For example, imagine a discussion board that > attempts to implement a [code=html] bbcode as a <code type="html"> > element: the part that would go inside [code...]...[/code] shouldn't > be escaped, since doing so would cause the page to render the escapes > rather than the entities they represent. This means that the poster > could add extraneous </code> tags to trigger unexpected behavior of > the page/UA. Of course, it's possible to escape the </code>'s there to > prevent this, but then an opening <code> would break everything on the > page below the [code] block, since the </code> corresponding to the > closing bbtag would be paired with the inner <code>, not with the > original opener (since the inner's closer would have been escaped). We > could nest further "if"s to it, but at the end there is no approach > that prevents both injection and page breaking. > Conclusion: it would be a horrible idea to use a cdata'ish feature > with user-provided content (after all, the content is manipulated > server-side, so stuff can be programaticaly escaped avoiding the need > for such a feature on that context). But still, if someone follows > that horrible idea (Murphy's Law dictates that someone will, sooner or > later; and the vast size of the WWW hints that it would happen sooner > rather than later); the <code type="html"> approach is guaranteed to > raise issues; while the <![CDATA[ approach is not guaranteed (a > mechanism to sanitize it, while awfully ugly, does exist). > > That's the whole thing about the sanitizing issue. It's just a corner > case triggered by such a degree of bad design that the page probably > deserves to miserably break. But CDATA addresses the case, while <code > type="html"> doesn't; so it's a *minor* plus for CDATA over <code...>. > > > 2. I don't see most sites that incorporate user-generated content using > > CDATA blocks to render that content. How many sites these days > incorporate > > user-generated content without allowing *any* form of markup? Markdown, > > bbcode, etc.? Not in 2010. Instead, sites incorporate the content by > > sanitizing it and then processing the markdown/bbcode/whatever to result > in > > HTML. So the CDATA block is no help there. (This doesn't change its > > applicability to Georg's original use case. Again, all I'm addressing > here > > is the claim it makes sanitizing easier.) > My aplogie, I didn't meant to claim that CDATA makes sanitizing easier > in general. The actual claim intended (despite my bad wording) was > that it makes sanitizing possible on some corner cases where <code > type="html"> makes it impossible > > > So since my position is that I (and others) am not going to use CDATA > blocks > > for user-generated content, I still have to sanitize what they provide. > That's what you should do, so go ahead. CDATA is only appropriate as a > shorthand for manually-authored content. > > My concern was that I didn't want new rules to worry about (if the user > > includes CDATA blocks, perhaps as an attack vector), because of the > > thousands of hand-crafted sanitizers out there. > > As I mentioned in my follow-up post (you seem not to have seen it), on > > reflection I'm not sure how much of a problem it is. A sanitizer that > does a > > thorough job will (in my view) escape <, &, and > at a minimum, and so a > > user-contributed CDATA block won't be a CDATA block by the time we render > > it, because the < will be an entity. That leaves the ]]> at the end. With > a > > good sanitizer, that will become ]]>, but let's assume for the moment > > that we're dealing with a site using an "only okay" sanitizer that only > does > > < and &, but not > (there are *lots* of these out there) and so the ]]> > is > > left as-is. Unless that site is *also* putting that user-generated > content > > in a CDATA block, I *think* that's harmless. > On the cases you describe, CDATA on itself is harmless. Some issues > might arise with some weird markup through those "partial" sanitizers, > but that's independent from CDATA (that'd be the case for code that > abuses any obscure SGMLish feature that happens to be implementend by > one or me browsers; so it is rather rare, but possible in theory). > > > If the site with the "only > > okay" sanitizer *is* putting it in a CDATA block, well, then they should > be > > aware of the issue and deal with it. > The nice thing here is that even on such an ugly case CDATA allows for > a way to have the extraneous "]]>" rendered (and without it being > taken as the closer), so even when the code sample contains <![CDATA[ > ... ]]> blocks things can work smoothly. > > So I don't think it makes sanitizing any easier, but having worked it > > through, I don't *think* it makes it any harder, either, unless you start > > using CDATAs to render user-contributed content, in which case you're > aware > > of the new feature and should be dealing with it appropriately. > > > On a completely separate point, going back to Georg's use case: How would > he > > include characters that can't be expressed except via entities in the > > encoding he's using for his page? > The preferred way would be, by definition, to use an encoding that > supports the characters needed by the document. The common modern > defaults of utf-8 and utf-16 support all UNICODE characters, so they > are a safe bet. > If that's not possible, (for example, code that's part of a template > system where the content author has no control over the template), the > author might fall back to something like this: "]]>á<![CDATA[". > That's the same trick as when a "]]>" needs to be included. > > If I'm reading this section[1] of the XML > > spec correctly (and that's by no means certain!), you can't use HTML > > entities like á (not the best example, but you get the idea) in > CDATA > > sections (this based on the statement that <<...left angle brackets and > > ampersands may occur in their literal form; they need not (and cannot) be > > escaped using "<" and "&">> the key part of that being "and > cannot"). > There are two key points here: > 1) CDATA is *not* an element. It's more like a "command" that tells > the UA's parser to switch between "normal" and "no-parse" modes or > states. This means that we can switch between modes, using these > commands, whenever we feel like it, without having to bother about the > DOM or the document structure being affected. > 2) In order to allow unescaped content, escaping needs to be > explicitly disallowed. This happens for a quite simple reason: if > escapes were allowed (despite not being needed), then the '&' > character would need to be escaped as '&' when it's not introducing > an escape. This means that some escaping would be required within > CDATA, which would hence kill the whole purpose of CDATA. > So, how to address the issue? We can't have an escape within CDATA > mode. No problem: just leave CDATA mode, add the escape, and enter > CDATA mode again. Rinse and repeat as needed. > > This may seem too cumbersome, but let's keep in mind that the whole > purpose of CDATA is to avoid escaping. If someone is going to need > escaping nevertheless (due to document encoding limitations), then > that person should review whether CDATA is worth the effort for that > particular document. > > > But I'm having trouble reconciling that with this section[2] which says > > <<The right angle bracket (>) may be represented using the string ">", > > and must, for compatibility, be escaped using either ">" or a > character > > reference when it appears in the string "]]>" in content, when that > string > > is not marking the end of a CDATA section.>> You know more about CDATA > > sections than I do, can you clarify that for me? > Don't quote me on this (I have some experience working with CDATA and > other unusal stuff; but I'm not an expert), but I'm quite convinced > that the spec tries to say that: > - When "]]>" is found while in CDATA mode, it causes the browser to > leave CDATA mode. > - When "]]>" is found while *not* in CDATA mode (a.k.a. while in > content), the browser goes nuts. > - When "]]>" is found while in CDATA mode, "]]>" is rendered. > - When "]]>" is found while *not* in CDATA mode, "]]>" is rendered. > So, if you want to render "]]>" literally, you should use "]]>" > outsite of CDATA mode (if you are in CDATA mode, you need to leave > that mode, add that string, and then enter back into CDATA mode) > > Regards, > Eduard Pascual >
Received on Friday, 9 April 2010 07:03:38 UTC