From: Arthur Clifford <art@artspad.net>
Date: Tue, 6 Apr 2010 17:10:30 -0700
Cc: <public-html-comments@w3.org>
Message-ID: <00f601cad5e6$bc7614d0$0e14a8c0@iMacPCVirtualMachine>
If this is something that would be for code added more programmatically what
about something as simple as a content length attribute on the <code> or
<pre> tag that would allow you to specify how many bytes of data are between
the beginning and close tag? Then you wouldn't need to worry about special
characters at all. The User agent would read in x number of characters,
would know the next thing it should hit is </ and if it doesn't then it can
throw a validation error. The tags could also have a mime attribute so that
the user agent when rendering code/preformatted text could color code
syntax. With that W3 could work toward a code hinting standard and CSS for
code hinting for HTML6.

I've never been fond of the cdata tag syntax and have felt that xml elements
should have a contentLength attribute.

Of course I also think there should be a BEXML (Binary Enabled XML) standard
that allows traditional XML markup with the addition of a BData tag that
includes a content length attribute. In thinking about that, it dawned on me
that you could have data with any characters you want with the combination
of content length and mime type.

But I digress. Anything read in for code or pre text should be treated as
read and display with no processing. It could be read in and displayed as
text, but it WOULD be nice not to have to worry about escaping characters
when saving or returning data. Also, if someone put php or other server-side
code in that didn't result in output that is the same length after the php
processing then the content length would be wrong and the page would not, or
should not, load correctly.


Arthur Clifford

On Tue, Apr 6, 2010 at 5:00 PM, Georg <gesteehr@googlemail.com> wrote:
> Hello,
> I've a new idea for the HTML 5 specification.
> If you want to present HTML-code in a browser, you have to write &lt;
> instead of <.
That's not entirely true, see below.

> My Idea is, to include a type-attribute in the code-tag.
> If the type is "html" or "xml" webmaster don't have to write &lt;. They
> simply write <.
> Than the browser should show <.
> If you have questions for my idea, you can ask me.

The idea itself is not bad; but it has important drawbacks:

First, the code added there would need to be parsed, so if a <code> is
included there it will be paired with the corresponding </code>. This
then makes the page itself very brittle: any mistake in the code
intended to display may have very nasty side-effects (most
prominently, too many or too few </code>'s would miserably break the
page down to a nonsensical mess).

In addition, it may create a serious security vulnerability on sites
that allow users to provide content (discussion boards and blogs
accepting comments are among the best known examples, but there are
many others; for example, the archives of this very mailing list will
render the content I'm writing now as HTML). Imagine that, on a
phpBB-style board, a malicious user posts a message containing
"[code=html]</code><script>..." ([code]...[/code] is phpBB's own tag
to introduce a code snippet on the post). It may seem natural to turn
the BBtag into a <code type="html"> element; but it would be closed by
the initial "</code>" and then the <script> would be executed... it
may check for phpBB login cookies and harvest user's names and
passwords; and that's just a possible attack from the top of my head,
so imagine what someone actually evil might come up with.

Of course, that's just a matter of input sanitizing; but is there any
real need to open a new gate for injection-based attacks? Also, note
that with your proposal sanitizing is quite more than trivial: a code
snippet may include a legitimate </code> if it also included matching
<code>'s; so rather than just escaping or breaking instances of a
specific text, you need to parse the whole input to infer the
structure. And, btw, poor design or simple bugs on such a parser can
open the gate to DoS attacks.

In any case, on XHTML documents, your problem is already addressed by
a feature from XML itself: <[CDATA[ ... ]]>. This is far easier to
sanitize (you just need to ensure that the input doesn't include the
"]]>" sequence), thus being more usable on user-provided content. The
only drawback of it is that "non-X" HTML doesn't support it (except
within MathML and SVG content; but that doesn't address the use case).

So, the solution you propose is quite broken, but your use case
(presenting HTML code samples within HTML pages in a way that is saner
to author and maintain) is a quite good one.
Because of that, I'd like to propose allowing <[CDATA[ ... ]]>
generally. I'm sure there has been some reason it isn't that allowed
yet; but it may be worth reviewing the reasons for that in the arise
of a use case.

The main drawback I can think of is compatibility; but it may be fine
to wait for older browsers to die off before relying on this; and I
don't expect for legacy content to break with a change like this
(after all, anyone who ever types "<[CDATA[" most probably knows that
the leading "<" should be escaped as "&lt;" for it to render as text.
Also, trying to look through the list archive for discussions on the
topic has been quite fruitless (searching for "CDATA" yields too many
unrelated results; while searching for "<[CDATA[" to refine the search
yields an ugly error); so if this has been discussed before I hope
some of the "veterans" in the list can at least point to the
discussions or summarize the issues.

Eduard Pascual
