Cougar DTD: Do not use CDATA declared content for SCRIPT

The 12-July-1996 draft of the "Cougar" HTML DTD [1] declares:

    <!ELEMENT SCRIPT - - CDATA -- script statements -->

This will not work.

In particular, the use of CDATA declared content is incompatible with
JavaScript (which, I presume, will be one of the primary scripting
languages used in HTML documents).

The main reason for this is that the arguments to JavaScript's 
'document.write()' method [2], which inserts text and HTML markup
into a document, may contain end-tags, e.g.:


    <SCRIPT>
	document.write("<H1>", "Foo", "</H1>")
    </SCRIPT>


Elements with CDATA declared content cannot contain any
sequence of characters that "looks like" an end-tag  --
ETAGO (</) followed by a letter -- since that will prematurely
terminate the element.  There is no way around this; it is
a fundamental problem with CDATA declared content.

Here are a few alternatives:

1) Use <!ELEMENT SCRIPT - - (#PCDATA)>, and require all occurrences
of '<', '&', and '>' in the content to be replaced with '&lt;',
'&amp;', and '&gt;'.  This is more consistent with the rest of HTML.

2) Use <!ELEMENT SCRIPT - - (#PCDATA)> and add browser support 
for CDATA marked sections:

    <SCRIPT><![ CDATA [
	document.write("<H1>", "Foo", "</H1>")
    ]]></SCRIPT>

This is the approach favored by most other SGML applications.

3) Allow scripts to be included by external reference:

	<SCRIPT SRC="http://www.foo.com/myscript.js"></SCRIPT>

This approach may increase network latency, but has the advantage of 
better backward-compatibility with SCRIPT-unaware user agents.


 * * *


CDATA declared content is in general a bad idea (it should
not be used for STYLE either, and IMO the XMP and LISTING
elements should be removed entirely.)  Of all of SGML's 
broken features, CDATA declared content is among the worst. 
For more details, please refer to the relevant entries on 
Robin Cover's SGML Web Page [3] under "Other Grammar/Parsing 
Issues and FEATURES" [4].

Many of the issues brought up there are not particularly
relevant to the Web, though there are other problems with CDATA
declared content that make it especially dangerous for HTML.
[I've expounded on this before on html-wg, but because of the
current lack of a working archive I can't cite references :-(]

Two things that come to mind are that the presence of *any*
element with CDATA or RCDATA declared content in the HTML DTD
makes it much more difficult to write a Web search engine -- it
becomes necessary to parse against the DTD instead of simple
lexical scanning, e.g., with tools like Dan Connolly's lexical
analyzer [5] -- and that it greatly increases the amount of SGML
knowledge necessary for authors to construct a valid document
including such elements.



[1] <URL: http://www.w3.org/pub/WWW/MarkUp/Cougar/HTML.dtd >
[2] <HURL: http://www.netscape.com/eng/mozilla/2.0/handbook/
	javascript/ref_t-z.html#write_method >
[3] <URL: http://www.sil.org/sgml/ >
[4] <URL: http://www.sil.org/sgml/topics.html#miscGrammar >
[5] <URL: http://www.w3.org/pub/WWW/TR/WD-sgml-lex >


--Joe English

  joe@art.com

Received on Tuesday, 23 July 1996 20:20:47 UTC