Re: HTML 5 from T.J. Crowder on 2010-04-07 (public-html-comments@w3.org from April 2010)

From: T.J. Crowder <tj@crowdersoftware.com>
Date: Wed, 7 Apr 2010 06:49:12 +0100
To: Eduard Pascual <herenvardo@gmail.com>
Cc: gesteehr@googlemail.com, public-html-comments@w3.org
Message-ID: <p2rc95470a1004062249zbc35206bh3d0c3f6cfc2341c2@mail.gmail.com>
>
> <[CDATA[ ... ]]>.  This is far easier to

sanitize (you just need to ensure that the input doesn't include the

"]]>" sequence), thus being more usable on user-provided content.


What makes ]]> easier to defend against than </code>? In fact, I think ]]>
is probably more susceptible to injection vulnerabilities in the wild than
</code> for the simple reason that some naive HTML encoders only encode
ampersand and <, not >, and so while the latter would be defeated by extant
software (turned into &lt;/code>), the former would not be.

But as a software engineer, I'm not excited by <code> having special
encoding characteristics. In addition to treating < as already-escaped,
would & be treated as already-escaped as well? What does that mean for
entities? This makes life very difficult for people trying to pre-process
user-generated content for display within HTML, the rules become markedly
more complex.
--
T.J. Crowder
Independent Software Consultant
tj / crowder software / com
www.crowdersoftware.com


On 7 April 2010 00:02, Eduard Pascual <herenvardo@gmail.com> wrote:

> On Tue, Apr 6, 2010 at 5:00 PM, Georg <gesteehr@googlemail.com> wrote:
> > Hello,
> >
> > I’ve a new idea for the HTML 5 specification.
> >
> > If you want to present HTML-code in a browser, you have to write &lt;
> > instead of <.
> That's not entirely true, see below.
>
> > My Idea is, to include a type-attribute in the code-tag.
> >
> > If the type is “html” or “xml” webmaster don’t have to write &lt;. They
> can
> > simply write <.
> >
> > Than the browser should show <.
> >
> > If you have questions for my idea, you can ask me.
>
> The idea itself is not bad; but it has important drawbacks:
>
> First, the code added there would need to be parsed, so if a <code> is
> included there it will be paired with the corresponding </code>. This
> then makes the page itself very brittle: any mistake in the code
> intended to display may have very nasty side-effects (most
> prominently, too many or too few </code>'s would miserably break the
> page down to a nonsensical mess).
>
> In addition, it may create a serious security vulnerability on sites
> that allow users to provide content (discussion boards and blogs
> accepting comments are among the best known examples, but there are
> many others; for example, the archives of this very mailing list will
> render the content I'm writing now as HTML). Imagine that, on a
> phpBB-style board, a malicious user posts a message containing
> "[code=html]</code><script>..." ([code]...[/code] is phpBB's own tag
> to introduce a code snippet on the post). It may seem natural to turn
> the BBtag into a <code type="html"> element; but it would be closed by
> the initial "</code>" and then the <script> would be executed... it
> may check for phpBB login cookies and harvest user's names and
> passwords; and that's just a possible attack from the top of my head,
> so imagine what someone actually evil might come up with.
>
> Of course, that's just a matter of input sanitizing; but is there any
> real need to open a new gate for injection-based attacks? Also, note
> that with your proposal sanitizing is quite more than trivial: a code
> snippet may include a legitimate </code> if it also included matching
> <code>'s; so rather than just escaping or breaking instances of a
> specific text, you need to parse the whole input to infer the
> structure. And, btw, poor design or simple bugs on such a parser can
> open the gate to DoS attacks.
>
>
> In any case, on XHTML documents, your problem is already addressed by
> a feature from XML itself: <[CDATA[ ... ]]>. This is far easier to
> sanitize (you just need to ensure that the input doesn't include the
> "]]>" sequence), thus being more usable on user-provided content. The
> only drawback of it is that "non-X" HTML doesn't support it (except
> within MathML and SVG content; but that doesn't address the use case).
>
> So, the solution you propose is quite broken, but your use case
> (presenting HTML code samples within HTML pages in a way that is saner
> to author and maintain) is a quite good one.
> Because of that, I'd like to propose allowing <[CDATA[ ... ]]>
> generally. I'm sure there has been some reason it isn't that allowed
> yet; but it may be worth reviewing the reasons for that in the arise
> of a use case.
>
> The main drawback I can think of is compatibility; but it may be fine
> to wait for older browsers to die off before relying on this; and I
> don't expect for legacy content to break with a change like this
> (after all, anyone who ever types "<[CDATA[" most probably knows that
> the leading "<" should be escaped as "&lt;" for it to render as text.
> Also, trying to look through the list archive for discussions on the
> topic has been quite fruitless (searching for "CDATA" yields too many
> unrelated results; while searching for "<[CDATA[" to refine the search
> yields an ugly error); so if this has been discussed before I hope
> some of the "veterans" in the list can at least point to the
> discussions or summarize the issues.
>
> Regards,
> Eduard Pascual
>
>
Received on Wednesday, 7 April 2010 05:50:07 UTC