Re: Cougar DTD: Do not use CDATA declared content for SCRIPT

Arne Knudson (ack@ebt.com)
Fri, 26 Jul 1996 17:14:05 -0400


Date: Fri, 26 Jul 1996 17:14:05 -0400
Message-Id: <199607262114.RAA25133@ebt-inc.ebt.com>
To: www-html@w3.org
From: Arne Knudson <ack@ebt.com>
Subject: Re: Cougar DTD: Do not use CDATA declared content for SCRIPT

>But I still don't see the nightmare. What is markup for except to tell
>the typesetter (browser, in this case) what to do with the content?

     Aargh! I'm guessing that this may have been said before, but it's
critical to understand. HTML is an application of SGML, the Standardized
Generalized Markup Language, ISO 8879:1986. What you refer to as "markup"
has **absolutely nothing** to do with the formatting of the document. Markup
is a description of the *structure* of the document. The original
aspirations of SGML were to have 3 parts[1] to every document: the Document
Type Definition (DTD), which uniquely and specifically defines the structure
of a particular "type" of document; the document instance, which is marked
up in accordance to the structure of the specific DTD; and the "stylesheet",
which is the description of how to present information encoded based upon
that DTD. People argued how to implement the third phase, so ISO 8879 only
covered the first two parts, and left the third until later. Just recently,
the ISO passed a new standard, ISO 10179:1996
(<http://occam.sjf.novell.com:8080/dsssl/dsssl96/> or
<ftp://ftp.ebt.com/pub/dsssl/>) describing DSSSL, the Document Style
Semantics and Specification Language.
     The advantage to a separate stylesheet is that the individual "browser"
can choose how to present the semantic information presented in the
document. My friend Chris, at home, can format his Mosaic stylesheets to
present all his <H1> elements in green helvetica 24 point centered bold
text. And since I hate helvetica, I'll choose have my browser put my <H1>'s
in 18-point Times instead. Since I also hate the <BLINK> tag (a most hideous
construct if ever I saw one), I'll tell my browser to do absolutely nothing
when I encounter it. But my other (visually impaired) friend, Jane, can sit
at her desk at work and have her voice synthesizer read all the <H1>'s
louder than the <H2>'s, which are louder than the <H3>'s, which are about as
loud as normal text (but at a slightly different inflection). A separate
stylesheet removes the onus of presentation from the author, allowing her to
focus on the semantically significant aspects. And it allows the individual
to customize their presentation in the manner they see fit.
     [end semantic-vs-presentation rant]
     Now I'll get into why there's a nightmare. The SGML standard (which
we're sticking to at the moment) states that a parser is responsible for
starting an element whenever it encounters STAGO (a less-than bracket, "<"),
followed by text. Most people know this construct as the start-tag, which
looks like <foo>, but there are other constructs that are allowable that we
won't get into here. Furthermore, the parser is responsible for terminating
an element when it encounters an ETAGO (usually, a less-than followed by a
slash, "</"). Most people see this in end-tags, such as </foo>, but again
there's other constructs that are allowable; but for this discussion, just
think of an ETAGO as the first part of an end-tag. Now, it's possible to
encode sections of text such that they should not be parsed as SGML --
start-tags should be ignored, as should entities and other SGML-isms. Such
sections of text are called CDATA elements ("Characater DATA", as opposed to
PCDATA, or "Parsed Character DATA", as opposed to... well, let's just say
there's a list of 'em). But the standard specifically states that CDATA
elements terminate upon the first occurrence of an ETAGO.[2] (This, I think,
was to make it a little easy on the parsers; it means they don't have to
worry about performing lookahead to figure out whether or not they really
should terminate the element, particularly if they're using shorttag or
other funky things that SGML lets you do.) This means you can't have
end-tags in CDATA elements, PERIOD. Unfortunately, a lot of people want to
have scripts that dynamically generate HTML, and most HTML that I know of
has at least a few end-tags.
     There was a proposal a while back to use CDATA marked sections.
Briefly, marked sections are denoted as in the following example:

<![CDATA[this is <foo> text that &bar; should not get parsed]]>

They (usually) start with "<![", and end with "]]>"; inside the first
bracket is a special reserved word denoting what kind of marked section it
is (in this case, CDATA). Then, after the reserved word, is another square
bracket signaling the start of the content of the marked section. In the
above example, since it's declared as a CDATA marked section, a parser would
parse it as:

this is <foo> text that &bar; should not get parsed

Note the fact that the <foo> and &bar; are parsed as raw text, and are
considered neither elements nor entities. Other types of marked sections
include "INCLUDE", meaning a parser should use them, and "IGNORE", which
means that the parser should discard everything up to the close of the
section[3].
     This solves the problem of the end-tags, because the SGML parser
ignores everything up to the "]]>". It introduces a whole new can of worms,
though, because we're forcing the authors to put "<![CDATA[...]]>" inside of
every SCRIPT element to protect the content of the script, which means the
scripting parser will have to know to discard it. Furthermore, if the
programmer should want to (God forbid) put a "]]>" into the document,
they're just as screwed as they were back when they were trying to put
end-tags into CDATA elements.
     I think the only workable SGML-aware solution is the <SCRIPT SRC="...">
external-reference method. Mixing languages (and SGML *is* a language) is
always dangerous, and you usually have to present the embedded language in
some "special" way. Take, for example, shell scripts; if you want to have
the shell script execute a command as if it was typed into the shell, you
have to do things like "escape" the backslashes and double-quotes and so on.
The string you put into the shell script looks very little like what you'd
type at the command line. Similarly, if you were to embed scripts inline in
HTML, you're either going to have to do it through an external reference, or
special-case anything that could get in the way of a parser.

-Arne

[1] SGML geeks: Yes, I know, there are actually 4 parts to every document,
when you count the SGML declaration, but I didn't think it relevant to this
topic.
[2] Remember, an ETAGO is generally "</".
[3] It's useful for things like overloading elements and aliasing; you can
define multiple chunks of your document as <![%foo[...]]>, and at the top of
your document, define the entity %foo to be either IGNORE, if you don't want
to use them, or INCLUDE, if you do.