Re: HTML 5 from Eduard Pascual on 2010-04-08 (public-html-comments@w3.org from April 2010)

From: Eduard Pascual <herenvardo@gmail.com>
Date: Thu, 8 Apr 2010 05:06:26 +0200
To: art@artspad.net
Cc: public-html-comments@w3.org
Message-ID: <j2i6ea53251004072006obb3dbe3asb0ac4903adff93df@mail.gmail.com>
Looks like you are confusing T.J.'s posts and mine with each other's.

On Thu, Apr 8, 2010 at 4:03 AM, Arthur Clifford <art@artspad.net> wrote:
> I mentioned <code> when I probably meant <samp> but in earlier posts I did
> mention <pre>. For user specified code that needs to be displayed as code
> and not interpreted there are or were tags for doing that. These tags should
> be considered to have CDATA as their child node so that the CDATA tag itself
> should not be necessary.
In general the use of an actual element for this task presents some drawbacks:
- The CDATA content is bound to be the content of a leaf element: it
will span the full element; and it can have no children because you'd
need the child's '<' opener to be parsed, which is contradictory with
the full purpose of the element. CDATA yields more flexibility.
- It's possible that the sampled code makes use of the same element;
and it is also possible that the sample is just a fragment, so it
shouldn't be expected to be well-formed. This makes detecting the end
of the CDATA'ish element quite challenging, and impossible to solve
for the general case (at least, with existing HTML parsing rules).

> The debate then is whether it is time to consider an alternative where
> sanitizing is not necessary because the content block is known to be treated
> as plain text? If so, what is the approach?
Sanitizing / escaping through scripts seems quite trivial. Big issue
here is for manually-authored content, since manually escaping so many
chars is tedious and error-prone. Special emphasis on error-prone, IMO
it's the main reason this should be addressed.

> T.J. was suggesting the XML-industry-standard CData tag approach.
Actually, it was me; but don't worry, I'm not fame-hungry :P

> I was suggesting contentLength (or maybe just length) as an attribute, thus
> negating the need for escaping anything in the text content. My thinking, in
> the user-submitted blog context, is that the server is going to be yanking
> the content from somewhere and dynamically putting it in a page, it would be
> really easy for it to programmatically get the length when sending back
> responses. The only reason to use contentLength would be that you know you
> have characters that would confuse the browser, and such an attribute would
> have to be optional.
This has two core drawbacks I already mentioned:
1) It misses a (quite important, IMO) part of the use-cases; and
2) It may make a document much more brittle in the face of data losses
due to network issues.

> I suggested a mime attribute, I should have called it a syntax attribute.
> Because if you have pre or samp or whatever treated as cdata where you know
> the text is text, then there's also the possibility of allowing a savvy
> user-agent to make sample code even more readable by providing syntax
> coloring and indenting; which can be very useful for developer blogs
> especially for longer examples.
Again, this is really interesting, but it is unrelated to the problem
Georg posted. Also, I'm not sure if this is something that should be
addressed by CSS rather than HTML.

> The other suggestion was to provide an attribute for a special end sequence;
> borrowing from the PHP Heredoc technique where you have <<<CUSTOM_ENDING
> Text
> More stuff
> CUSTOM_ENDING
>
> So, I'm thinking:
> <pre end="CUSTOM_ENDING">
> Text
> More stuff
> CUSTOM_ENDING
> </pre>
>
> The user agent would have to know to leave out CUSTOM_ENDING though.
Ok, this is quite at the same level as CDATA, at least on technical
benefits and costs: both approaches need UAs to update their parsers,
and require some non-trivial updates to the spec text. Your idea,
however, sacrifices the flexibility to have both CDATA and parsed
content within the same element, which might get useful on some
circumstances. For example, you could have something like this:
<code><![CDATA[
   (some code here...)
]]>
<mark class="error" title="Compiler error 1234: that ain't
work"><![CDATA[ offending code line here ]]></mark>
<![CDATA[
   (and more code here...)
]]></code>
which would look in the DOM as:
<code>
    TEXT
    <mark>
        TEXT
    TEXT
reflecting the natural structure of the content (ie: a single code
block, text, <mark>, text inside it, and text inside the <mark>). Any
form to implement this with your proposal would either sacrifice the
mark, or have multiple <code> nodes in the DOM despite the whole thing
being only a single block in nature.
Besides the better flexibility; CDATA has the slight advantadges of
being based on previously existing web-related technologies, so some
degree of implementation experience is already available; and it
allows making the feature consistent between "soup" and XML
serializations (remember that XHTML inherently has CDATA, which is
part of XML).

> If you wanted to include/exclude entity sanitizing in the pre and samp tags
> you could have an attribute for that as well.
>
> Any solution that is chosen, would need to accommodate folks who are
> sanitizing things as well as work with the newer technique.
>
> Obviously, there's a common practice in place, but is it a good practice or
> one that has been necessary because nobody's taken the time to address this
> issue in detail?
>
> T.J. FYI, I agree in principal with the need for cdata equivalent
> functionality, but I'd rather see pre or samp or an equivalent html tag be
> used and updated to include new optional attributes/parameters. Of course,
> until every browser complies correctly, you'll probably have to detect
> browser version and sanitize content anyway :/
Again, this is not about sanitizing. The original poster made no
mention on sanitizing. I did mention it, just to describe a drawback
of the original proposal; but it's not the issue being addressed.
In fact, I regret having commented about that. It was a quite
tangential issue, corner-case fool-proofing, and it has deviated the
whole thread from what the original post asked for.

Regards,
Eduard Pascual
Received on Thursday, 8 April 2010 03:07:14 UTC