Re: HTML 5 from Eduard Pascual on 2010-04-08 (public-html-comments@w3.org from April 2010)

From: Eduard Pascual <herenvardo@gmail.com>
Date: Thu, 8 Apr 2010 07:40:46 +0200
To: art@artspad.net
Cc: public-html-comments@w3.org
Message-ID: <x2h6ea53251004072240x5057ddcfgf62ad7acc3ea76d8@mail.gmail.com>
On Thu, Apr 8, 2010 at 6:10 AM, Arthur Clifford <art@artspad.net> wrote:
> How would having cdata blocks be any more beneficial than multiple code
> blocks.
CDATA blocks aren't on themselves beneficial over multiple code
blocks. The *freedom* to use a single or multiple <code> elements
depending on the natural structure of the content is the benefit. It's
a matter of *flexibility*: CDATA doesn't force you into any structure,
because it's a "super-escape" rather than a structuring tool. Mere
syntax sugar to avoid having to manually escape dozens of special
characters in a relative short fragment. Since the original poster
described the need in terms of escaping, a solution that handles
exactly that without side-effects seems the right choice.

> The idea is to markup a chunk of text as text.
This is the hyperTEXT markup language. Everything on a document is
either *text*, or markup describing such text. Describing a chunk of
text as being text is essentially pointless: the mere fact of not
being markup already marks it up as text.
Oh, wait, we are speaking of dealing with those chunks of text that
look like markup... it's obvious that using markup for that is very
likely to make things messy and error-prone (thus providing not much
benefit over simply manually escaping the code).
> If you wanted a
> section of html that had a combination of code and markup you would make a
> div tag and have whatever markup you want with code/pre blocks.
If I want anything your proposal can do, I could do it with either
code/pre blocks or with CDATA. If I want a single <code> with sporadic
<mark>s within it, however, the only approach that would allow me to
save the escapes within the code is to use CDATA.
> The content
> of the block just needs to be treated as text, preferably without having to
> escape anything. So:
>
> <div>
> <h3>Here's some example of my code, why doesn't this work?</h3>
> <code syntax="html">
>        <html>
>        <body>
>        <span>hello world</code><html></body>
> </code>
> <span class="sig" ><a href="mailto:confused@users.com">Confused
> User</a></span>
>
> DOM:
> Div
>  h3
>  code
>  span
>    a
Thanks for proving my points. Essentially, your code would yield
something like this:
(<html>) (implicit, but if it's not included in the html file the UA
will add it)
    (<body>) (implicit again)
        <div>
            <h3>
            <code>
            error: unexpected <html> (not sure what the UA is supposed
to do here)
error: unexpected "</code> <span class="sig" ><a
href="mailto:confused@users.com">Confused User</a></span>" after
</body>
Note how much a </code> where a </span> was intended messes things up.
> Or:
>
> <div>
> <h3>Bad code</h3>
> <code syntax="html" end="END">
>        <html>
>        <body>
>        <span>hello world</code><html></body>
>        END
> </code>
> <h3>Good code</h3>
> <code syntax="html" end="END">
>        <html>
>        <body>
>                <span>hello world</span>
>        </body>
>        </html>
>        END
> </code>
> <span class="sig" ><a href="mailto:confused@users.com">Confused
> User</a></span>
> </div>
Yep, that would work, and be as safe as CDATA. However, wouldn't it be
interesting to be able to <mark> the difference between the "bad" and
the "good" code samples? Your approach doesn't allow that. The closest
you could get would be breaking each sample into three <code> tags.
Then if you also wanted to add a pale red background to the bad code,
things begin to get ugly...
Why should we settle with a solution that addresses only an arbitrary
subset of the use cases, when for the same costs we can have one that
addresses them all?


> That's what the pre and
>
> As to XML
> I know Wikipedia is hardly the best place to quote. But as it mentions:
> " CDATA-type element content
> An SGML DTD may declare an element's content as being of type CDATA. Within
> a CDATA-type element, no markup will be processed. It is similar to a CDATA
> section in XML, but has no special boundary markup, as it applies to the
> entire element."
Yes, and it's used heavily... to describe the content model of
attributes. Actually, it's quite a good thing, for sanity's sake, that
attributes can't contain elements or other attributes. After all, for
element content, SGML already provides the <![CDATA[ ... ]]> syntax,
so there is no need on SGML-based languages to define any element's
content model as CDATA: the language user can have un-parsed content
by explicitly requesting it.
Note that if <code> were defined as CDATA, it could never, ever, have
children. This means that JS script libraries that provide syntax
highlighting, for example, would stop working (they basically boil
down to wrapping each token within <span class="keyword">, <span
class="variable">, and so on). It also means that highlighting a part
of the code with <mark> (or with any other markup mechanism) would be
doomed to fail.
The one element that has long been defined as CDATA was <title> (I
guess this was to discourage aberrations such as
"<title><marquee><blink>The best webpage in the
world!!!!!!!!!!!!!!!!!!!!</blink></marquee></title>" and mere
annoyances such as trying to change the colors in the title via
<font>).

> SO, it is also industry standard to define a tag as being of TYPE CDATA so
> UAs know not to process markup. I would bet that xhtml defines the pre tag
> as cdata and the html/xhtml standard says to render it using a monospaced
> font/style.
As a matter of fact, you just lost the bet:
http://www.w3.org/TR/xhtml1/dtds.html. The "original" version of
XHTML, straight from the source. Also note that styling is an entirely
separate topic (handled by CSS and by "default rendering" rules that
are generally implemented as UA stylesheets).

> If you want XML syntax, use XHTML. Personally (and yes it is
> just my opinion) I find the use of CDATA tag in xml as a hack solution for
> when a schema is poorly defined or incomplete.
It's not a hack. It's just a shorthand for manually escaping all the
stuff inside it; just like <td nowrap> is a shorthand for <td
nowrap="nowrap">; and just like in C/C++/C#/JS/Java/etc something like
"++x" is just a shorthand for "x = x + 1".
> All you are doing is marking
> up something as text and you aren't defining the document structure.
Almost. You aren't marking it up as text, nor marking it at all: you
are just telling the browser "hey, take all of this as if everything
was escaped, because manually escaping would be an overkill". To
define the document structure, you just use elements. Each tool for
the job it's meant for.

> I don't care either way about the content length option, it obviously
> wouldn't help manual data entry at all. But if a connection is flaky you are
> going to have weird results no matter what. And if anything, you'd want some
> way to know that text treated that should be text is not rendered nor
> anything after it be rendered as anything other than text. THAT is actually
> the best argument in favor of escaping and sanitizing, because if anything
> hiccups it means you have non-functioning escaped text rather than
> potentially harmful scripts.
Browsers are already quite careful with what scripts they execute. But
they are very eager on rendering content. After all, just because a
network connection isn't doing its best should the user be left
without access to the content? On many cases a lost byte will only
mean a lost character.
Actually, I have faced the issue of a disk with many html documents
being damaged; and I'm really thankful of the effort FF put on
rendering the remaining data, as it allowed me to salvage some
valuable data.
But on any case, the main issue with the contentLength solution is not
the brittleness (that's quite secondary), but the fact that it doesn't
address the main subset of use-cases.

> Someobody brought up injection concerns earlier.
Yes. It was me. I have already said that it was *just a secondary
issue* of the original proposal by Georg (it would only affect
corner-cases with blatant design problems).
> However, a browser *should* know packets or data were dropped in
> transfer and cease rendering content as a basic safety measure.
Fortunately, browsers are smarter than that. They will be careful with
running scripts and other interactive content (they'll probably try
anyway), but rendering stuff on the screen is quite harmless.
Actually, it's better to render a bunch of gibberish from which maybe
some data can be salvaged than refusing to try at all.
> I'm not
> deeply familiar with the http standard, isn't there something in the
> handshaking between client and server to deal with that?
IIRC, there is a Content-Lenght header the server sends with the
content as part of the response. A quick google lookup yields
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.13 .

Regards,
Eduard Pascual
Received on Thursday, 8 April 2010 05:41:49 UTC