- From: Eduard Pascual <herenvardo@gmail.com>
- Date: Thu, 8 Apr 2010 07:40:46 +0200
- To: art@artspad.net
- Cc: public-html-comments@w3.org
On Thu, Apr 8, 2010 at 6:10 AM, Arthur Clifford <art@artspad.net> wrote: > How would having cdata blocks be any more beneficial than multiple code > blocks. CDATA blocks aren't on themselves beneficial over multiple code blocks. The *freedom* to use a single or multiple <code> elements depending on the natural structure of the content is the benefit. It's a matter of *flexibility*: CDATA doesn't force you into any structure, because it's a "super-escape" rather than a structuring tool. Mere syntax sugar to avoid having to manually escape dozens of special characters in a relative short fragment. Since the original poster described the need in terms of escaping, a solution that handles exactly that without side-effects seems the right choice. > The idea is to markup a chunk of text as text. This is the hyperTEXT markup language. Everything on a document is either *text*, or markup describing such text. Describing a chunk of text as being text is essentially pointless: the mere fact of not being markup already marks it up as text. Oh, wait, we are speaking of dealing with those chunks of text that look like markup... it's obvious that using markup for that is very likely to make things messy and error-prone (thus providing not much benefit over simply manually escaping the code). > If you wanted a > section of html that had a combination of code and markup you would make a > div tag and have whatever markup you want with code/pre blocks. If I want anything your proposal can do, I could do it with either code/pre blocks or with CDATA. If I want a single <code> with sporadic <mark>s within it, however, the only approach that would allow me to save the escapes within the code is to use CDATA. > The content > of the block just needs to be treated as text, preferably without having to > escape anything. So: > > <div> > <h3>Here's some example of my code, why doesn't this work?</h3> > <code syntax="html"> > <html> > <body> > <span>hello world</code><html></body> > </code> > <span class="sig" ><a href="mailto:confused@users.com">Confused > User</a></span> > > DOM: > Div > h3 > code > span > a Thanks for proving my points. Essentially, your code would yield something like this: (<html>) (implicit, but if it's not included in the html file the UA will add it) (<body>) (implicit again) <div> <h3> <code> error: unexpected <html> (not sure what the UA is supposed to do here) error: unexpected "</code> <span class="sig" ><a href="mailto:confused@users.com">Confused User</a></span>" after </body> Note how much a </code> where a </span> was intended messes things up. > Or: > > <div> > <h3>Bad code</h3> > <code syntax="html" end="END"> > <html> > <body> > <span>hello world</code><html></body> > END > </code> > <h3>Good code</h3> > <code syntax="html" end="END"> > <html> > <body> > <span>hello world</span> > </body> > </html> > END > </code> > <span class="sig" ><a href="mailto:confused@users.com">Confused > User</a></span> > </div> Yep, that would work, and be as safe as CDATA. However, wouldn't it be interesting to be able to <mark> the difference between the "bad" and the "good" code samples? Your approach doesn't allow that. The closest you could get would be breaking each sample into three <code> tags. Then if you also wanted to add a pale red background to the bad code, things begin to get ugly... Why should we settle with a solution that addresses only an arbitrary subset of the use cases, when for the same costs we can have one that addresses them all? > That's what the pre and > > As to XML > I know Wikipedia is hardly the best place to quote. But as it mentions: > " CDATA-type element content > An SGML DTD may declare an element's content as being of type CDATA. Within > a CDATA-type element, no markup will be processed. It is similar to a CDATA > section in XML, but has no special boundary markup, as it applies to the > entire element." Yes, and it's used heavily... to describe the content model of attributes. Actually, it's quite a good thing, for sanity's sake, that attributes can't contain elements or other attributes. After all, for element content, SGML already provides the <![CDATA[ ... ]]> syntax, so there is no need on SGML-based languages to define any element's content model as CDATA: the language user can have un-parsed content by explicitly requesting it. Note that if <code> were defined as CDATA, it could never, ever, have children. This means that JS script libraries that provide syntax highlighting, for example, would stop working (they basically boil down to wrapping each token within <span class="keyword">, <span class="variable">, and so on). It also means that highlighting a part of the code with <mark> (or with any other markup mechanism) would be doomed to fail. The one element that has long been defined as CDATA was <title> (I guess this was to discourage aberrations such as "<title><marquee><blink>The best webpage in the world!!!!!!!!!!!!!!!!!!!!</blink></marquee></title>" and mere annoyances such as trying to change the colors in the title via <font>). > SO, it is also industry standard to define a tag as being of TYPE CDATA so > UAs know not to process markup. I would bet that xhtml defines the pre tag > as cdata and the html/xhtml standard says to render it using a monospaced > font/style. As a matter of fact, you just lost the bet: http://www.w3.org/TR/xhtml1/dtds.html. The "original" version of XHTML, straight from the source. Also note that styling is an entirely separate topic (handled by CSS and by "default rendering" rules that are generally implemented as UA stylesheets). > If you want XML syntax, use XHTML. Personally (and yes it is > just my opinion) I find the use of CDATA tag in xml as a hack solution for > when a schema is poorly defined or incomplete. It's not a hack. It's just a shorthand for manually escaping all the stuff inside it; just like <td nowrap> is a shorthand for <td nowrap="nowrap">; and just like in C/C++/C#/JS/Java/etc something like "++x" is just a shorthand for "x = x + 1". > All you are doing is marking > up something as text and you aren't defining the document structure. Almost. You aren't marking it up as text, nor marking it at all: you are just telling the browser "hey, take all of this as if everything was escaped, because manually escaping would be an overkill". To define the document structure, you just use elements. Each tool for the job it's meant for. > I don't care either way about the content length option, it obviously > wouldn't help manual data entry at all. But if a connection is flaky you are > going to have weird results no matter what. And if anything, you'd want some > way to know that text treated that should be text is not rendered nor > anything after it be rendered as anything other than text. THAT is actually > the best argument in favor of escaping and sanitizing, because if anything > hiccups it means you have non-functioning escaped text rather than > potentially harmful scripts. Browsers are already quite careful with what scripts they execute. But they are very eager on rendering content. After all, just because a network connection isn't doing its best should the user be left without access to the content? On many cases a lost byte will only mean a lost character. Actually, I have faced the issue of a disk with many html documents being damaged; and I'm really thankful of the effort FF put on rendering the remaining data, as it allowed me to salvage some valuable data. But on any case, the main issue with the contentLength solution is not the brittleness (that's quite secondary), but the fact that it doesn't address the main subset of use-cases. > Someobody brought up injection concerns earlier. Yes. It was me. I have already said that it was *just a secondary issue* of the original proposal by Georg (it would only affect corner-cases with blatant design problems). > However, a browser *should* know packets or data were dropped in > transfer and cease rendering content as a basic safety measure. Fortunately, browsers are smarter than that. They will be careful with running scripts and other interactive content (they'll probably try anyway), but rendering stuff on the screen is quite harmless. Actually, it's better to render a bunch of gibberish from which maybe some data can be salvaged than refusing to try at all. > I'm not > deeply familiar with the http standard, isn't there something in the > handshaking between client and server to deal with that? IIRC, there is a Content-Lenght header the server sends with the content as part of the response. A quick google lookup yields http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.13 . Regards, Eduard Pascual
Received on Thursday, 8 April 2010 05:41:49 UTC