- From: Eduard Pascual <herenvardo@gmail.com>
- Date: Thu, 8 Apr 2010 02:13:51 +0200
- To: "T.J. Crowder" <tj@crowdersoftware.com>
- Cc: gesteehr@googlemail.com, public-html-comments@w3.org
Honestly, I think you are missing a key point on the CDATA concept: when dealing with untrusted content (like user input), CDATA and typical sanitizing are *exclusive* alternatives: you shouldn't sanitize content that will go inside a <![CDATA[ ... ]]> block, other than replacing any "]]>" as I already described; and you shouldn't sanitize the resulting CDATA block either. The whole beautiful thing about CDATA is that nothing, absolutely nothing, after the <![CDATA[ opener is parsed at all, until the closing ]]> is found. That's exactly what CDATA does; nothing more, and nothing less. And that's why it doesn't need sanitizing: the content is taken as plain text, not as markup, regardless of how markup'ish it may look. The replacement "]]>" => "]]>]]><![CDATA[" I suggested previously isn't really an "escape", in the pure sense of the term: it just closes the CDATA block (with the "]]>" part), inserts the "]]>" to get it rendered as "]]>" (since it's outside the CDATAs, the entity reference is expanded as usual), and then opens a new CDATA block (with the remaining "<![CDATA[" part of the replacement) to enclose the remaining content. That's why I said that it is "dodged" rather than "escaped". On Wed, Apr 7, 2010 at 10:45 PM, T.J. Crowder <tj@crowdersoftware.com> wrote: >> For content generated programatically, it's quite indifferent to use >> >> CDATA or to escape stuff. > > No, there's a very large difference. Currently, if I have user-generated > content, It seems I owe you a bit of clarification here: I tried to differentiate between purely programatic content and user-provided content. The former would include stuff such as a date string generated from a timestamp, an include that chooses one of a reduced set of static files, or geolocation info from the client's IP address, to put some examples; this is stuff generated by scripts without any direct contribution from the user. On these cases, the program/script will already know what kind of content it's generating, and what should it escape; and most server-side scripting technologies provide a variety of facilities to handle escaping when it is needed; so CDATA should be irrelevant for that kind of content. For the case of user-provided content, CDATA isn't that much better than a good sanitizer, but it's still as good as a sanitizer: putting the user stuff inside the CDATA block ensures that nothing within it will have any special meaning for the browser. You only need to prevent the user from closing the CDATA block with a "]]>" (which would allow adding actual code afterwards), which is achieved with a single string replacement operation. However, the goal of the CDATA proposal is *not* to sanitize content. Looking at the original post: On Tue, Apr 6, 2010 at 5:00 PM, Georg <gesteehr@googlemail.com> wrote: > If you want to present HTML-code in a browser, you have to write < > instead of <. You see, I was not making my assumption on the blind. I was just trying to address the need exposed by Georg. Since escaping wouldn't be a real problem if the content was script-generated, I *did* assume that the focus was on hand-authored content. I am suggesting CDATA because it was created exactly to address this kind of need (saving the pain of manually escaping everything when lots of special chars need to be rendered as content). There is no mention to user-provided content on Georg's post. I was the one who mentioned it. And I did that to highlight that Georg's proposed solution (ignoring special chars within a <code> element with a specific attribute) would open up injection risks if used for user-provided content (and Murphy's Law requires us to assume that, sooner or later, someone would do that if it's allowed). However, this is only a side benefit of CDATA over Georg's proposal. Other benefits of CDATA are: - Already works on XHTML (only for documents served with an XHTML media type, such as application/xhtml+xml, and hence not for IE, which doesn't support XHTML media types). - Is valid (ie: allowed as per the specs) for all versions of X/HTML, with the sole exception of "non-X" HTML5, which disallows it quite explicitly. All versions of XHTML support it because it's defined on XML itself; and pre-HTML5 versions support it because they are spec'ed as SGML-based, and SGML also defines <![CDATA[...]]>. - It allows for greater flexibility: CDATA blocks aren't bound to any specific element. Actually, an element may contain both CDATA blocks and structured children. If the content that needs to be CDATA'ed actually matches an element, it's enough to wrap the block with such element. - In the event a CDATA block needs to contain the closing "]]>" sequence, this can be achieved (even if the syntax to do that is a bit verbose, this is a corner case and is solvable). On the other hand, for the <code type="html"> suggestion, it's impossible to define it in a way that allows all potential uses of "</code>" within the element (note that it would prevent escaping the "<" on that tag, since entity references would be taken literally instead of as actual references). The main drawback of CDATA in non-X HTML is the lack of browser support (despite it has been part of the HTML standards since HTML 2 or even earlier). However, any possible solution to this use-case will suffer from the same issue, so we will have to wait for wide browser support on any case. Note: I know that HTML5 is *not* SGML, and I understand the reasons for that choice. I'm not asking for HTML5 to be SGML; but only proposing to scavenge one specific feature from SGML because it is a good solution to the use-case described, and it has the benefits listed above. Regards, Eduard Pascual
Received on Thursday, 8 April 2010 00:14:38 UTC