- From: T.J. Crowder <tj@crowdersoftware.com>
- Date: Wed, 7 Apr 2010 06:49:12 +0100
- To: Eduard Pascual <herenvardo@gmail.com>
- Cc: gesteehr@googlemail.com, public-html-comments@w3.org
- Message-ID: <p2rc95470a1004062249zbc35206bh3d0c3f6cfc2341c2@mail.gmail.com>
> > <[CDATA[ ... ]]>. This is far easier to sanitize (you just need to ensure that the input doesn't include the "]]>" sequence), thus being more usable on user-provided content. What makes ]]> easier to defend against than </code>? In fact, I think ]]> is probably more susceptible to injection vulnerabilities in the wild than </code> for the simple reason that some naive HTML encoders only encode ampersand and <, not >, and so while the latter would be defeated by extant software (turned into </code>), the former would not be. But as a software engineer, I'm not excited by <code> having special encoding characteristics. In addition to treating < as already-escaped, would & be treated as already-escaped as well? What does that mean for entities? This makes life very difficult for people trying to pre-process user-generated content for display within HTML, the rules become markedly more complex. -- T.J. Crowder Independent Software Consultant tj / crowder software / com www.crowdersoftware.com On 7 April 2010 00:02, Eduard Pascual <herenvardo@gmail.com> wrote: > On Tue, Apr 6, 2010 at 5:00 PM, Georg <gesteehr@googlemail.com> wrote: > > Hello, > > > > I’ve a new idea for the HTML 5 specification. > > > > If you want to present HTML-code in a browser, you have to write < > > instead of <. > That's not entirely true, see below. > > > My Idea is, to include a type-attribute in the code-tag. > > > > If the type is “html” or “xml” webmaster don’t have to write <. They > can > > simply write <. > > > > Than the browser should show <. > > > > If you have questions for my idea, you can ask me. > > The idea itself is not bad; but it has important drawbacks: > > First, the code added there would need to be parsed, so if a <code> is > included there it will be paired with the corresponding </code>. This > then makes the page itself very brittle: any mistake in the code > intended to display may have very nasty side-effects (most > prominently, too many or too few </code>'s would miserably break the > page down to a nonsensical mess). > > In addition, it may create a serious security vulnerability on sites > that allow users to provide content (discussion boards and blogs > accepting comments are among the best known examples, but there are > many others; for example, the archives of this very mailing list will > render the content I'm writing now as HTML). Imagine that, on a > phpBB-style board, a malicious user posts a message containing > "[code=html]</code><script>..." ([code]...[/code] is phpBB's own tag > to introduce a code snippet on the post). It may seem natural to turn > the BBtag into a <code type="html"> element; but it would be closed by > the initial "</code>" and then the <script> would be executed... it > may check for phpBB login cookies and harvest user's names and > passwords; and that's just a possible attack from the top of my head, > so imagine what someone actually evil might come up with. > > Of course, that's just a matter of input sanitizing; but is there any > real need to open a new gate for injection-based attacks? Also, note > that with your proposal sanitizing is quite more than trivial: a code > snippet may include a legitimate </code> if it also included matching > <code>'s; so rather than just escaping or breaking instances of a > specific text, you need to parse the whole input to infer the > structure. And, btw, poor design or simple bugs on such a parser can > open the gate to DoS attacks. > > > In any case, on XHTML documents, your problem is already addressed by > a feature from XML itself: <[CDATA[ ... ]]>. This is far easier to > sanitize (you just need to ensure that the input doesn't include the > "]]>" sequence), thus being more usable on user-provided content. The > only drawback of it is that "non-X" HTML doesn't support it (except > within MathML and SVG content; but that doesn't address the use case). > > So, the solution you propose is quite broken, but your use case > (presenting HTML code samples within HTML pages in a way that is saner > to author and maintain) is a quite good one. > Because of that, I'd like to propose allowing <[CDATA[ ... ]]> > generally. I'm sure there has been some reason it isn't that allowed > yet; but it may be worth reviewing the reasons for that in the arise > of a use case. > > The main drawback I can think of is compatibility; but it may be fine > to wait for older browsers to die off before relying on this; and I > don't expect for legacy content to break with a change like this > (after all, anyone who ever types "<[CDATA[" most probably knows that > the leading "<" should be escaped as "<" for it to render as text. > Also, trying to look through the list archive for discussions on the > topic has been quite fruitless (searching for "CDATA" yields too many > unrelated results; while searching for "<[CDATA[" to refine the search > yields an ugly error); so if this has been discussed before I hope > some of the "veterans" in the list can at least point to the > discussions or summarize the issues. > > Regards, > Eduard Pascual > >
Received on Wednesday, 7 April 2010 05:50:07 UTC