RE: HTML 5 from Arthur Clifford on 2010-04-07 (public-html-comments@w3.org from April 2010)

From: Arthur Clifford <art@artspad.net>
Date: Tue, 6 Apr 2010 19:06:20 -0700
To: <public-html-comments@w3.org>
Message-ID: <010001cad5f6$eb293d60$0e14a8c0@iMacPCVirtualMachine>
Well, from the general perspective of importance, this is pretty much a non
issue because things can be escaped and everything does it.

If this were a discussion about optimizing html, then I'd say that the need
to escape and unescape text before and after every request is generally a
waste of processing ticks if there is a way to avoid it. Granted, these days
there's more than enough processing ticks to spare. But at some point
everybody counted on their fingers, that doesn't mean you don't consider a
calculator.

Then, there's the situation where eventually someone will say, hey I want to
include html that has sample xml in my html cdata tag but my sample xml has
cdata in it and so it has a closing cdata tag. At which point you are back
to the same problem of having to escape and unescape syntax. Whereas with
content length you could include very complex xml/html/javascript/whatever
and it wouldn't matter because the parser would jump to the end of the
content and know whatever it got is a blob of text to display.

I wouldn't mind an actual cdata tag over the <![CDATA[ hacky tag. Something
along the lines of:
<cdata contentLength="..." mime="text/javascript">...</cdata>

You know, an official tag you can do thigns like apply styles to, provide an
id for, and have javascript access to the content of. However it beign
cdata, the mime type would not tell a user agent to render the text as
something other than text, it would just give the user agent options for
formatting the text.

If not content length, then perhaps a delimiter attribute:
<cdata delimiter="***" >Some really awesome marked up text ***</cdata>

The problem with deducing the end of a data chunk, and not knowing what you
have in it, is that the user agent has to deduce it and whatever marker you
are using to deduce the end has to be escaped or not included. If the parser
knows how long the data chunk is, it doesn't have to deduce anything and
thus you have full freedom. The delimiter technique would give the parser
something to read ahead for that could be a sequence you can simply not
include in the data. As far as mime typing, the more the user agent knows
about what it renders, even if it is to be treated as plain text, the more
flexibility it has to do something interesting.

In terms of processing, every language has its strlen or String.length()
equivalent and it should be a lot less processing to include that info in a
tag attribute than it is to do a search and replace on every special
character that might choke the html parser .. twice (during storage and
during display)!

Art

Arthur Clifford


-----Original Message-----
From: Eduard Pascual [mailto:herenvardo@gmail.com] 
Sent: Tuesday, April 06, 2010 6:35 PM
To: art@artspad.net
Cc: public-html-comments@w3.org
Subject: Re: HTML 5

On Wed, Apr 7, 2010 at 2:10 AM, Arthur Clifford <art@artspad.net> wrote:
> If this is something that would be for code added more programmatically
IMO, that's the least important use-case: as far as I know, most
programming/scripting languages used on the web (both client- and
server- side) have some facility that turns the escaping taskt into
something trivial (like PHP's htmlspecialchars() function). Even when
such a facility is not available, the task is quite simple with raw
text-based replacing utilities (just remember to do "&"=>"&amp;"
first, so you don't re-escape the &'s that are inserted as part of the
other escapes).

The CDATA approach is a more general solution.

> what
> about something as simple as a content length attribute on the <code> or
> <pre> tag that would allow you to specify how many bytes of data are
between
> the beginning and close tag?
This could easily become more painful than actually escaping the
content if you ever need to manually edit it; so you are reducing the
use-case even further to only those cases where the content is only
handled programatically. Not a very good "solution" if it only solves
a probably small fraction of the problems.

> Then you wouldn't need to worry about special
> characters at all. The User agent would read in x number of characters,
> would know the next thing it should hit is </ and if it doesn't then it
can
> throw a validation error.
Then a single byte is lost due to a bad connection and, instead of
just missing a character, the whole page breaks miserably. This
doesn't suit very well in HTML5's "make sure even the craziest
tag-soup renders as close to the author's intent as possible"
philosophy. An UA could try to play smart on these cases, but it risks
to messing things up. What about if the lost byte was part of the
length attribute? Part of the snippet would be parsed as actual page
code, which leads to injection issues again (it's not an intentional
attack, but code being injected by pure randomness on a browser isn't
something vendors will be happy to implement into their UAs).

> The tags could also have a mime attribute so that
> the user agent when rendering code/preformatted text could color code
> syntax. With that W3 could work toward a code hinting standard and CSS for
> code hinting for HTML6.
While I think this is interesting, it happens to be entirely unrelated
with the problem/use-case being discussed and with the potential
solution you proposed.
I would invite you to branch it to a new thread, but the discussion
would essentially boil down to the fact that, on HTML's part, syntax
highlighting could already be applied with data-* attributes plus CSS
attribute selectors; so there is no point on adding further hooks for
as long as CSS doesn't provide anything for syntax highlighting. If
you still want to discuss this, then go ahead and start a new thread
on the topic. (From my own experience, I can assure you that trying to
discuss two separate topics on the same thread can easily become
nearly insane).

> I've never been fond of the cdata tag syntax
If there is any technical issue with the syntax itself (your personal
dislike may be an issue, but if it's "personal" then it's not
"technical"), HTML5 could define a different syntax for the same task
(after all, <![CDATA[ ... ]]> is not part of HTML, it's only part of
XML, that's why it's readily available for XHTML). From the top of my
head, things like "<< ... >>", "<[[ ... ]]>", "<! ... !>" could work
(actually, the last would quite clash with HTML's legacy inheritance
from SGML, such as the <!-- --> comment syntax; but these are just
examples).

> and have felt that xml elements
> should have a contentLength attribute.
If you have use-cases with specific requirements that are addressed by
such a feature, go ahead and start a new thread to discuss the idea.
If you can't materialize that "feeling", however, I wouldn't put much
hopes on it being taken into account by the editor.

>
> Of course I also think there should be a BEXML (Binary Enabled XML)
standard
> that allows traditional XML markup with the addition of a BData tag that
> includes a content length attribute. In thinking about that, it dawned on
me
> that you could have data with any characters you want with the combination
> of content length and mime type.
That's another feature suggestion without a hint on the use-cases or
requirements. Same comments as above. However, that one sounds
interesting; so if you have use-cases to warrant it some discussion be
assured I'm going to be closely following the topic :P

> But I digress. Anything read in for code or pre text should be treated as
> read and display with no processing. It could be read in and displayed as
> text, but it WOULD be nice not to have to worry about escaping characters
> when saving or returning data. Also, if someone put php or other
server-side
> code in that didn't result in output that is the same length after the php
> processing then the content length would be wrong and the page would not,
or
> should not, load correctly.
I already mentioned it on my reply to the OP, but the key issue here
is: if you ignore the markup while reading the contents of a given
element, how do you know where the element is? If you are watching
only for the specific "</code>" tag, how do you differentiate that one
from a pair "<code> ... </code>" within the snippet?
>
> Art
>
> Arthur Clifford
Received on Wednesday, 7 April 2010 02:10:48 UTC