Re: HTML 5 from Eduard Pascual on 2010-04-08 (public-html-comments@w3.org from April 2010)

From: Eduard Pascual <herenvardo@gmail.com>
Date: Thu, 8 Apr 2010 02:13:51 +0200
To: "T.J. Crowder" <tj@crowdersoftware.com>
Cc: gesteehr@googlemail.com, public-html-comments@w3.org
Message-ID: <m2t6ea53251004071713r48cadf95h6426e1bd96ba81a9@mail.gmail.com>
Honestly, I think you are missing a key point on the CDATA concept:
when dealing with untrusted content (like user input), CDATA and
typical sanitizing are *exclusive* alternatives: you shouldn't
sanitize content that will go inside a <![CDATA[ ... ]]> block, other
than replacing any "]]>" as I already described; and you shouldn't
sanitize the resulting CDATA block either.
The whole beautiful thing about CDATA is that nothing, absolutely
nothing, after the <![CDATA[ opener is parsed at all, until the
closing ]]> is found. That's exactly what CDATA does; nothing more,
and nothing less. And that's why it doesn't need sanitizing: the
content is taken as plain text, not as markup, regardless of how
markup'ish it may look. The replacement "]]>" => "]]>]]&gt;<![CDATA["
I suggested previously isn't really an "escape", in the pure sense of
the term: it just closes the CDATA block (with the "]]>" part),
inserts the "]]&gt;" to get it rendered as "]]>" (since it's outside
the CDATAs, the entity reference is expanded as usual), and then opens
a new CDATA block (with the remaining "<![CDATA[" part of the
replacement) to enclose the remaining content. That's why I said that
it is "dodged" rather than "escaped".

On Wed, Apr 7, 2010 at 10:45 PM, T.J. Crowder <tj@crowdersoftware.com> wrote:
>> For content generated programatically, it's quite indifferent to use
>>
>> CDATA or to escape stuff.
>
> No, there's a very large difference. Currently, if I have user-generated
> content,
It seems I owe you a bit of clarification here: I tried to
differentiate between purely programatic content and user-provided
content. The former would include stuff such as a date string
generated from a timestamp, an include that chooses one of a reduced
set of static files, or geolocation info from the client's IP address,
to put some examples; this is stuff generated by scripts without any
direct contribution from the user. On these cases, the program/script
will already know what kind of content it's generating, and what
should it escape; and most server-side scripting technologies provide
a variety of facilities to handle escaping when it is needed; so CDATA
should be irrelevant for that kind of content.

For the case of user-provided content, CDATA isn't that much better
than a good sanitizer, but it's still as good as a sanitizer: putting
the user stuff inside the CDATA block ensures that nothing within it
will have any special meaning for the browser. You only need to
prevent the user from closing the CDATA block with a "]]>" (which
would allow adding actual code afterwards), which is achieved with a
single string replacement operation. However, the goal of the CDATA
proposal is *not* to sanitize content.

Looking at the original post:
On Tue, Apr 6, 2010 at 5:00 PM, Georg <gesteehr@googlemail.com> wrote:
> If you want to present HTML-code in a browser, you have to write &lt;
> instead of <.
You see, I was not making my assumption on the blind. I was just
trying to address the need exposed by Georg. Since escaping wouldn't
be a real problem if the content was script-generated, I *did* assume
that the focus was on hand-authored content.
I am suggesting CDATA because it was created exactly to address this
kind of need (saving the pain of manually escaping everything when
lots of special chars need to be rendered as content).

There is no mention to user-provided content on Georg's post. I was
the one who mentioned it. And I did that to highlight that Georg's
proposed solution (ignoring special chars within a <code> element with
a specific attribute) would open up injection risks if used for
user-provided content (and Murphy's Law requires us to assume that,
sooner or later, someone would do that if it's allowed). However, this
is only a side benefit of CDATA over Georg's proposal.

Other benefits of CDATA are:
- Already works on XHTML (only for documents served with an XHTML
media type, such as application/xhtml+xml, and hence not for IE, which
doesn't support XHTML media types).
- Is valid (ie: allowed as per the specs) for all versions of X/HTML,
with the sole exception of "non-X" HTML5, which disallows it quite
explicitly. All versions of XHTML support it because it's defined on
XML itself; and pre-HTML5 versions support it because they are spec'ed
as SGML-based, and SGML also defines <![CDATA[...]]>.
- It allows for greater flexibility: CDATA blocks aren't bound to any
specific element. Actually, an element may contain both CDATA blocks
and structured children. If the content that needs to be CDATA'ed
actually matches an element, it's enough to wrap the block with such
element.
- In the event a CDATA block needs to contain the closing "]]>"
sequence, this can be achieved (even if the syntax to do that is a bit
verbose, this is a corner case and is solvable). On the other hand,
for the <code type="html"> suggestion, it's impossible to define it in
a way that allows all potential uses of "</code>" within the element
(note that it would prevent escaping the "<" on that tag, since entity
references would be taken literally instead of as actual references).

The main drawback of CDATA in non-X HTML is the lack of browser
support (despite it has been part of the HTML standards since HTML 2
or even earlier). However, any possible solution to this use-case will
suffer from the same issue, so we will have to wait for wide browser
support on any case.

Note: I know that HTML5 is *not* SGML, and I understand the reasons
for that choice. I'm not asking for HTML5 to be SGML; but only
proposing to scavenge one specific feature from SGML because it is a
good solution to the use-case described, and it has the benefits
listed above.

Regards,
Eduard Pascual
Received on Thursday, 8 April 2010 00:14:38 UTC