RE: HTML 5 from Arthur Clifford on 2010-04-08 (public-html-comments@w3.org from April 2010)

From: Arthur Clifford <art@artspad.net>
Date: Wed, 7 Apr 2010 19:03:38 -0700
To: <public-html-comments@w3.org>
Message-ID: <016b01cad6bf$b52ccc30$0e14a8c0@iMacPCVirtualMachine>
I mentioned <code> when I probably meant <samp> but in earlier posts I did
mention <pre>. For user specified code that needs to be displayed as code
and not interpreted there are or were tags for doing that. These tags should
be considered to have CDATA as their child node so that the CDATA tag itself
should not be necessary.

The problem is what happens when you want html, or xml, or any other syntax
heavy code in one of those tags and to do so in such a way that it is not
treated as parsable/executable code. The solution to date is to use entities
and other escapes which will be interpreted as visual characters so that the
result when it renders looks like the intended syntax heavy code.

In the context of blogs, the user specified data, say html sample code, is
going to be sent to a server stored and eventually returned in a blog page
somewhere. In that process any text provided will, or should be, run through
sanitizers. The general trend is to escape everything that will confuse the
html/xml parser.

The debate then is whether it is time to consider an alternative where
sanitizing is not necessary because the content block is known to be treated
as plain text? If so, what is the approach?

T.J. was suggesting the XML-industry-standard CData tag approach.

I was suggesting contentLength (or maybe just length) as an attribute, thus
negating the need for escaping anything in the text content. My thinking, in
the user-submitted blog context, is that the server is going to be yanking
the content from somewhere and dynamically putting it in a page, it would be
really easy for it to programmatically get the length when sending back
responses. The only reason to use contentLength would be that you know you
have characters that would confuse the browser, and such an attribute would
have to be optional.

I suggested a mime attribute, I should have called it a syntax attribute.
Because if you have pre or samp or whatever treated as cdata where you know
the text is text, then there's also the possibility of allowing a savvy
user-agent to make sample code even more readable by providing syntax
coloring and indenting; which can be very useful for developer blogs
especially for longer examples.

The other suggestion was to provide an attribute for a special end sequence;
borrowing from the PHP Heredoc technique where you have <<<CUSTOM_ENDING
Text
More stuff
CUSTOM_ENDING

So, I'm thinking:
<pre end="CUSTOM_ENDING">
Text
More stuff
CUSTOM_ENDING
</pre>

The user agent would have to know to leave out CUSTOM_ENDING though.

If you wanted to include/exclude entity sanitizing in the pre and samp tags
you could have an attribute for that as well.

Any solution that is chosen, would need to accommodate folks who are
sanitizing things as well as work with the newer technique.

Obviously, there's a common practice in place, but is it a good practice or
one that has been necessary because nobody's taken the time to address this
issue in detail?

T.J. FYI, I agree in principal with the need for cdata equivalent
functionality, but I'd rather see pre or samp or an equivalent html tag be
used and updated to include new optional attributes/parameters. Of course,
until every browser complies correctly, you'll probably have to detect
browser version and sanitize content anyway :/

Art
Received on Thursday, 8 April 2010 02:03:54 UTC