- From: Arne Knudson <ack@ebt.com>
- Date: Fri, 26 Jul 1996 17:14:05 -0400
- To: www-html@w3.org
>But I still don't see the nightmare. What is markup for except to tell >the typesetter (browser, in this case) what to do with the content? Aargh! I'm guessing that this may have been said before, but it's critical to understand. HTML is an application of SGML, the Standardized Generalized Markup Language, ISO 8879:1986. What you refer to as "markup" has **absolutely nothing** to do with the formatting of the document. Markup is a description of the *structure* of the document. The original aspirations of SGML were to have 3 parts[1] to every document: the Document Type Definition (DTD), which uniquely and specifically defines the structure of a particular "type" of document; the document instance, which is marked up in accordance to the structure of the specific DTD; and the "stylesheet", which is the description of how to present information encoded based upon that DTD. People argued how to implement the third phase, so ISO 8879 only covered the first two parts, and left the third until later. Just recently, the ISO passed a new standard, ISO 10179:1996 (<http://occam.sjf.novell.com:8080/dsssl/dsssl96/> or <ftp://ftp.ebt.com/pub/dsssl/>) describing DSSSL, the Document Style Semantics and Specification Language. The advantage to a separate stylesheet is that the individual "browser" can choose how to present the semantic information presented in the document. My friend Chris, at home, can format his Mosaic stylesheets to present all his <H1> elements in green helvetica 24 point centered bold text. And since I hate helvetica, I'll choose have my browser put my <H1>'s in 18-point Times instead. Since I also hate the <BLINK> tag (a most hideous construct if ever I saw one), I'll tell my browser to do absolutely nothing when I encounter it. But my other (visually impaired) friend, Jane, can sit at her desk at work and have her voice synthesizer read all the <H1>'s louder than the <H2>'s, which are louder than the <H3>'s, which are about as loud as normal text (but at a slightly different inflection). A separate stylesheet removes the onus of presentation from the author, allowing her to focus on the semantically significant aspects. And it allows the individual to customize their presentation in the manner they see fit. [end semantic-vs-presentation rant] Now I'll get into why there's a nightmare. The SGML standard (which we're sticking to at the moment) states that a parser is responsible for starting an element whenever it encounters STAGO (a less-than bracket, "<"), followed by text. Most people know this construct as the start-tag, which looks like <foo>, but there are other constructs that are allowable that we won't get into here. Furthermore, the parser is responsible for terminating an element when it encounters an ETAGO (usually, a less-than followed by a slash, "</"). Most people see this in end-tags, such as </foo>, but again there's other constructs that are allowable; but for this discussion, just think of an ETAGO as the first part of an end-tag. Now, it's possible to encode sections of text such that they should not be parsed as SGML -- start-tags should be ignored, as should entities and other SGML-isms. Such sections of text are called CDATA elements ("Characater DATA", as opposed to PCDATA, or "Parsed Character DATA", as opposed to... well, let's just say there's a list of 'em). But the standard specifically states that CDATA elements terminate upon the first occurrence of an ETAGO.[2] (This, I think, was to make it a little easy on the parsers; it means they don't have to worry about performing lookahead to figure out whether or not they really should terminate the element, particularly if they're using shorttag or other funky things that SGML lets you do.) This means you can't have end-tags in CDATA elements, PERIOD. Unfortunately, a lot of people want to have scripts that dynamically generate HTML, and most HTML that I know of has at least a few end-tags. There was a proposal a while back to use CDATA marked sections. Briefly, marked sections are denoted as in the following example: <![CDATA[this is <foo> text that &bar; should not get parsed]]> They (usually) start with "<![", and end with "]]>"; inside the first bracket is a special reserved word denoting what kind of marked section it is (in this case, CDATA). Then, after the reserved word, is another square bracket signaling the start of the content of the marked section. In the above example, since it's declared as a CDATA marked section, a parser would parse it as: this is <foo> text that &bar; should not get parsed Note the fact that the <foo> and &bar; are parsed as raw text, and are considered neither elements nor entities. Other types of marked sections include "INCLUDE", meaning a parser should use them, and "IGNORE", which means that the parser should discard everything up to the close of the section[3]. This solves the problem of the end-tags, because the SGML parser ignores everything up to the "]]>". It introduces a whole new can of worms, though, because we're forcing the authors to put "<![CDATA[...]]>" inside of every SCRIPT element to protect the content of the script, which means the scripting parser will have to know to discard it. Furthermore, if the programmer should want to (God forbid) put a "]]>" into the document, they're just as screwed as they were back when they were trying to put end-tags into CDATA elements. I think the only workable SGML-aware solution is the <SCRIPT SRC="..."> external-reference method. Mixing languages (and SGML *is* a language) is always dangerous, and you usually have to present the embedded language in some "special" way. Take, for example, shell scripts; if you want to have the shell script execute a command as if it was typed into the shell, you have to do things like "escape" the backslashes and double-quotes and so on. The string you put into the shell script looks very little like what you'd type at the command line. Similarly, if you were to embed scripts inline in HTML, you're either going to have to do it through an external reference, or special-case anything that could get in the way of a parser. -Arne [1] SGML geeks: Yes, I know, there are actually 4 parts to every document, when you count the SGML declaration, but I didn't think it relevant to this topic. [2] Remember, an ETAGO is generally "</". [3] It's useful for things like overloading elements and aliasing; you can define multiple chunks of your document as <![%foo[...]]>, and at the top of your document, define the entity %foo to be either IGNORE, if you don't want to use them, or INCLUDE, if you do.
Received on Friday, 26 July 1996 17:15:22 UTC