Message-Id: <9212011741.AA22962@pixel.convex.com> To: timbl@nxoc01.cern.ch Cc: jim@wilbur.njit.edu (Jim Whitescarver), www-talk@nxoc01.cern.ch Subject: Re: The <PRE> tag In-Reply-To: Your message of "Tue, 01 Dec 92 16:41:19 +0100." <9212011541.AA01877@www3.cern.ch> Date: Tue, 01 Dec 92 11:41:34 CST From: Dan Connolly <connolly@pixel.convex.com> >Perhaps we should have all three. >I understand that these are the only SGML-conformant combinations. Is >this too much of a mess? I think so. The processing should be broken into two parts: SGML parsing, and application processing. The significance of newlines is an application issue: the SGML parser never throws out newlines in data (it does throw out newlines between tags and in some other places that I don't fully understand). These are the choices for SGML parsing: CDATA all characters treated as data. Terminated by </A where A is any letter. RCDATA characters and entities only. &entity; recognized. Terminated by </A as above mixed content tags and #PCDATA. Tags, entity references, comments, etc. recognized. The pattern of tags and data is regulated by the element declaration. element content tags only. Pattern of tags is regulated. ANY like mixed content, but tags aren't regulated CDATA is simplest to process, but you can't do things like char* any_string; printf("<XMP>%s</XMP>", any_string); because any_string might contain </A, and you're screwed. RCDATA is capable of the above construct, but at a cost: char* any_string; char* rcdata = HTML_replace_specials(any_string); printf("<XMP>%s</XMP>", rcdata); free(rcdata); where HTML_replace_specials changes '<' to < (to prevent </A), '>' to > (to prevent ]]>, the marked-section close delimiter. Ugh!), and '&' to & (to prevent &xxx from being mistaken for an entity reference). But if you're going to go to that trouble, you might as well use mixed content. That's why I changed my mind about using RCDATA for XMP and LISTING elements. My current DTD only uses element content (for the HTML document element*), CDATA (for XMP and LISTING) and mixed content (for everything else). As to your suggestions... ><XMP> newlines significant no anchors CDATA This is already supported, except that most implementations don't quite parse CDATA correctly. The "newlines significant" isn't a parsing issue. It's an issue of how the application processes character data. Let's call this mode of application processing where the characters are written to the screen as-is, rather than typeset into paragraphs TYPEWRITER mode. We'll call the default TYPESET mode. ><PRE> newlines significant no anchors RCDATA The only implementation of the PRE tag that I know of looks more like: <PRE> newlines significant anchors PCDATA It's actually pretty clean: you use mixed content SGML parsing, and TYPEWRITER application processing. So I changed the name to TYPEWRITER, for no good reason, really. The newlines significant/no anchors/RCDATA is what I suggested for XMP and LISTING, so they could contain any string. But since current implementations don't process entities in these elements, it's not worth it. ><FIXED> newlines not significant anchors PCDATA This introduces a third application mode besides RAW and TYPESET: it's kinda like RAW, but you toss the newlines, and start a newline at every <P> tag. I don't like it. >Tony, can you make a similar patch for <fixed> as above for Midas? You could, but it doesn't fit neatly into the current architecture. Tony wrote one widget to do TYPESET processing (SGMLCompoundText) and one to do TYPEWRITER processing (SGMLPlainText). The FIXED widget calls for a new widget, or a modification of SGMLPlainText to ignore newlines in some cases. (You can't just use the SGMLCompoundText with a fixed-width font, because it compresses whitespace.) >I have put Dan's new spec (which contains <typewriter> -- what's going on, >Dan?!) in the web at >http://info.cern.ch/hypertext/WWW/MarkUp/Connolly/MarkUp.html with a link from >the current spec. Thanks. > The DTD was not in the tar file, so Dan's previous one is >linked in. This includes all Dan's test HTML. Ack! I think the DTD is pretty important. I'll get the new one there ASAP. I highly suggest that _all_ data providers grab the DTD and the sgmls parser and try validating samples of the data they're serving up. It's the quickest and surest way to check for compliance. I need to write a section for data providers in the spec. >I would like to include <HEADER> and <BODY> tags too. * I wrestled with this at great length to come up with a DTD lends _some_ structure to HTML wihthout clashing badly with existing data. The document element declaration is: <!ELEMENT HTML O O ((TITLE? & NEXTID? & ISINDEX?), BODY)> The O O means the HTML start and end tags can be omitted. They'll be inferred by the parser. Since there's no #PCDATA in the content model, it has element content, so that whitespace between tags is thrown out. The TITLE, NEXTID, and ISINDEX can come in any order, and they can are optional, but they can appear at most once, and they have to be before the BODY. I made the <BODY> tags minimizable so current HTML is legal. I couldn't seem to work in a HEADER element the same way. Dan