Re: The <PRE> tag

Dan Connolly (connolly@pixel.convex.com)
Tue, 01 Dec 92 11:41:34 CST


Message-Id: <9212011741.AA22962@pixel.convex.com>
To: timbl@nxoc01.cern.ch
Cc: jim@wilbur.njit.edu (Jim Whitescarver), www-talk@nxoc01.cern.ch
Subject: Re: The <PRE> tag 
In-Reply-To: Your message of "Tue, 01 Dec 92 16:41:19 +0100."
             <9212011541.AA01877@www3.cern.ch> 
Date: Tue, 01 Dec 92 11:41:34 CST
From: Dan Connolly <connolly@pixel.convex.com>


>Perhaps we should have all three.

>I understand that these are the only SGML-conformant combinations.  Is
>this too much of a mess?

I think so. The processing should be broken into two parts: SGML parsing,
and application processing. The significance of newlines is an application
issue: the SGML parser never throws out newlines in data (it does throw
out newlines between tags and in some other places that I don't fully
understand).

These are the choices for SGML parsing:

	CDATA		all characters treated as data.
			Terminated by </A where A is any letter.
	RCDATA		characters and entities only. &entity; recognized.
			Terminated by </A as above
	mixed content	tags and #PCDATA.
			Tags, entity references, comments, etc. recognized.
			The pattern of tags and data is regulated
			by the element declaration.
	element content	tags only. Pattern of tags is regulated.
	ANY		like mixed content, but tags aren't regulated

CDATA is simplest to process, but you can't do things like

	char* any_string;
	printf("<XMP>%s</XMP>", any_string);

because any_string might contain </A, and you're screwed.

RCDATA is capable of the above construct, but at a cost:

	char* any_string;
	char* rcdata = HTML_replace_specials(any_string);
	printf("<XMP>%s</XMP>", rcdata);
	free(rcdata);

where HTML_replace_specials changes '<' to &lt; (to prevent </A), '>' to &gt;
(to prevent ]]>, the marked-section close delimiter. Ugh!), and
'&' to &amp; (to prevent &xxx from being mistaken for an entity reference).

But if you're going to go to that trouble, you might as well
use mixed content. That's why I changed my mind about using RCDATA
for XMP and LISTING elements.

My current DTD only uses element content (for the HTML document element*),
CDATA (for XMP and LISTING) and mixed content (for everything else).

As to your suggestions...

><XMP>		newlines significant		no anchors	CDATA

This is already supported, except that most implementations don't
quite parse CDATA correctly. The "newlines significant" isn't a
parsing issue. It's an issue of how the application processes character
data. Let's call this mode of application processing where the
characters are written to the screen as-is, rather than
typeset into paragraphs TYPEWRITER mode. We'll call the default TYPESET mode.

><PRE>		newlines significant		no anchors	RCDATA

The only implementation of the PRE tag that I know of looks more like:

<PRE>		newlines significant		anchors		PCDATA

It's actually pretty clean: you use
mixed content SGML parsing, and TYPEWRITER application processing.
So I changed the name to TYPEWRITER, for no good reason, really.

The newlines significant/no anchors/RCDATA is what I suggested for
XMP and LISTING, so they could contain any string. But since current
implementations don't process entities in these elements, it's
not worth it.

><FIXED>		newlines not significant 	anchors		PCDATA

This introduces a third application mode besides RAW and TYPESET:
it's kinda like RAW, but you toss the newlines, and start a newline
at every <P> tag. I don't like it.

>Tony, can you make a similar patch for <fixed> as above for Midas?

You could, but it doesn't fit neatly into the current architecture.
Tony wrote one widget to do TYPESET processing (SGMLCompoundText)
and one to do TYPEWRITER processing (SGMLPlainText). The FIXED
widget calls for a new widget, or a modification of SGMLPlainText
to ignore newlines in some cases. (You can't just use the SGMLCompoundText
with a fixed-width font, because it compresses whitespace.)

>I have put Dan's new spec (which contains <typewriter> -- what's going on,  
>Dan?!) in the web at  
>http://info.cern.ch/hypertext/WWW/MarkUp/Connolly/MarkUp.html with a link from
>the current spec.

Thanks.

>  The DTD was not in the tar file, so Dan's previous one is  
>linked in. This includes all Dan's test HTML.

Ack! I think the DTD is pretty important. I'll get the new
one there ASAP. I highly suggest that _all_ data providers grab the DTD
and the sgmls parser and try validating samples of the data they're
serving up. It's the quickest and surest way to check for compliance.
I need to write a section for data providers in the spec.

>I would like to include <HEADER> and <BODY> tags too.

* I wrestled with this at great length to come up with a DTD
lends _some_ structure to HTML wihthout clashing badly with
existing data.

The document element declaration is:
<!ELEMENT HTML O O  ((TITLE? & NEXTID? & ISINDEX?), BODY)>

The O O means the HTML start and end tags can be omitted.
They'll be inferred by the parser. Since there's no #PCDATA
in the content model, it has element content, so that
whitespace between tags is thrown out.

The TITLE, NEXTID, and ISINDEX can come in any order, and
they can are optional, but they can appear at most once,
and they have to be before the BODY.

I made the <BODY> tags minimizable so current
HTML is legal. I couldn't seem to work in a HEADER
element the same way.

Dan