Message-Id: <199510110359.XAA27861@beach.w3.org> To: firstname.lastname@example.org (Joe Wells) Cc: email@example.com Subject: Re: assorted HTML and SGML questions In-Reply-To: Your message of "Tue, 10 Oct 1995 22:40:50 EDT." <TueOct1022:30:firstname.lastname@example.org> Date: Tue, 10 Oct 1995 23:59:30 -0400 From: "Daniel W. Connolly" <email@example.com> In message <TueOct1022:30:firstname.lastname@example.org>, Joe Wells writes: > >I've got some questions that can probably be answered by an expert without >even thinking but which I haven't been able to find the answers to in my >WWW browsing. No fair! I spent at least 6 months figuring this stuff out way back when! You want me to give you the answers just like that? :-) Actually, these are pretty good questions. It's a shame that the answers aren't easier to find. The comp.text.sgml is a good place to take questions like this -- that's where I found my answers. I'm working on a tech report that bites off a chunk of these questions and explains them: "A Lexical Analyzer for HTML and Basic SGML" http://www.w3.org/pub/WWW/MarkUp/SGML/sgml-lex/sgml-lex.html Now to your questions... >Q: (("text/html" Internet Media Type)) Does text/html forbid including the > SGML declaration (<!SGML ...>)? My feeling is "yes, this is prohibited," but now that I think about it, I suppose there's no spec I can point to that says so. The reason that the HTML 2.0 spec didn't say straight out "Thou shalt use the SGML declaration in appendix XXX" is the work on internationalization. I18N involves some changes to the SGML declaration. An HTML user agent should know _exactly_ what SGML declaration is in effect once it's parsed and processed the media type and parameters (i.e. the Content-Type: header field). If you see no parameters, then the charset defaults to ISO-8859-1, and the SGML decl in the back of the HTML 2.0 spec is in effect. If you see charset=XXX, you have to see the recent html-i18n drafts. I suppose it could be specified more clearly now that some of the I18N questions have been answered. But it was the best info I had at the time. > Does it require that a PUBLIC external > identifier (i.e. PUBLIC "-//IETF//DTD HTML Level 2//EN") be included in > the DTD, if the DTD is included? Yes. The relevant section of the HTML 2.0 spec is: "HTML Public Text Identifiers" http://www.w3.org/pub/WWW/MarkUp/html-spec/html-spec_3.html#SEC16 |To identify information as an HTML document conforming to this |specification, each document must start with one of the following |document type declarations. | |<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> |<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Level 2//EN"> |<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Level 1//EN"> |<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Strict//EN"> |<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Strict Level 1//EN"> Those are your choices, if you want to conform exactly to the spec. > Does it forbid including a DTD > subset? Yes. See above. >Q: ((SGML Marked Sections)) The syntax for marked sections is not clear to > me. I would like to know precisely how to determine when the end of a > marked section has been reached. Look for "]]>". I think nested marked sections work as expected (i.e. NOT like C comments). > I've seen two grammars for this, one > from TEI (which is clearly wrong and disagrees with what "sgmls" does) > and one based on the standard which merely says the content of the > marked section is "SGML characters" (which is not helpful). What is > the precise syntax for marked sections? I'm afraid it's not specified in machine-readable form. It's only specified in the prose of the SGML spec. You could get Goldfarb's handbook and see if that makes more sense. > (Pointers to *net* resources > are greatly preferred to paper resources. Pointers to source code > should be to well-commented and clear source code; I've already tried > to figure this out by reading the source of sgmls.) I'm afraid the source code to sgmls or sp is the best source of info on this stuff that I've ever seen on the net. sgmls is the result of James Clark doing a bunch of release-engineering on some old code. Many folks don't find it very readable. SP is a from-scratch re-write. I'd expect it to be more readable, but I haven't looked closely. You might also try the YASP source code. All of these are available in the ftp.ifi.uio.no archive, maintained by Erik Naggum. >Q: ((HTML 3.0 with HTML.Recommended vs. Legacy Documents)) In HTML 3.0 > with HTML.Recommended enabled in the DTD, it is illegal to put text > directly inside an LI element, like this: > > <LI>Here are some words.</LI> > > This is legal: > > <LI><P>Here are some words.</P></LI> I think this is a mis-feature in the recent HTML 3 dtd. Dave Raggett likes it as is. > Is the first fragment supposed to be interpreted (rendered) like the > second one by an HTML browser? I suppose so. But we need a rigorous specification of it somewhere. I don't like this "required error-handling" stuff in the HTML 3 DTD. > I've noticed that many browsers > (e.g. Netscape 1.1N) treat them very differently. Netscape in > particular renders the second, legal version in a truly horrible > fashion. Hmmm... it seems like a bug to me too. The HTML 2.0 spec has some verbage in the conformance clause to the effect that a user agent isn't allowed to act differently on documents that have the same parse tree. e.g. comments aren't allowed to have any effect. Nor are </li> or </p> tags. But by the HTML 2.0 DTD, those two fragments produce _different_ parse trees. The first is an LI element with some data characters in it, and the second is a LI containing a P element which contains data characters. So the HTML 2.0 spec doesn't require that a user agent treat them the same. (Good taste might dictate that they be treated the same, however.) According to HTML 3.0 in strict mode, the first example is illegal. I think it's Dave's intent that it be treated like the second example, but I can't point to the spec that says so. >Q: ((HTML 3.0 TEXTAREA vs. Inclusion Exceptions)) [...] > using the proposed DTD the following HTML is valid: > > <FORM ACTION="http://dev.null.dom"> > <P> > <MATH> > <TEXTAREA NAME="foo" ROWS=1 COLS=1> > <SPOT ID="bar"> > <BOX> > yyy<SUP> > zzz > </SUP> > </BOX> > </TEXTAREA> > </MATH> > </P> > </FORM> I don't think that's supposed to be valid. It's a bug in the DTD if it is, I'd say. You might send mail to <email@example.com> or <firstname.lastname@example.org> or both -- Dave maintains the DTD, but Mark Gaither maintains the validation service, and he's helping us track and deal with public feedback. >Q: ((SGML Unclosed Start and End Tags)) Under what circumstances are > unclosed start and end tags allowed? http://www.w3.org/pub/WWW/MarkUp/html-spec/html-spec_foot.html#FOOT6 |The SGML declaration for HTML specifies SHORTTAG YES, which means that |there are other valid syntaxes for tags, such as NET tags, `<EM/.../'; |empty start tags, `<>'; and empty end-tags, `</>'. Until support for |these idioms is widely deployed, their use is strongly discouraged. In other words, "don't do that." Basically, the fact that short-cuts for attributes are lumped in with short-cuts for tags is a mis-feature of SGML that's going to be fixed. See: gopher://menja.ifi.uio.no:70/00/pub/SGML/ISO8879-rev/N1605.txt >Q: ((HTML 3.0 Dummy Elements)) In HTML 3.0, what is the purpose of having > the BODYTEXT and FIGTEXT elements at all? To avoid the "mixed content" snafu. This is an SGML FAQ. I'll let Joe English or somebody else fill in the details here. >Q: ((My SGML Confusion)) What is "SDATA"? Err... "system data" or something. Who knows, really. >Q: ((SGML vs. Carriage Returns)) The documentation for the program "sgmls" > says that it does this: > > 1. each carriage return character is turned into a > non-SGML character; > > 2. each newline character is turned into a record end > character, and at the same time a record start > character is inserted at the beginning of each > line; > > Is this part of the standard? Another SGML FAQ. This is a consequence of section 7.6.1 "Record Boundaries" in the SGML spec. My term-of-endearment is "phase-of-the-moon processing of newlines." Have a look at the comp.text.sgml archives and the html-wg archives. > Is this an appropriate thing to do for > unix compatibility because the convention on unix is that lines are not > started by anything and are ended by newlines? Basically, yes. >Q: ((SGML Grammar Confusion)) The grammar of SGML that I have seen says > one alternative for an "attribute value" is "character data". This > seems very open-ended and unspecified. What does this mean? Well, an attribute whose value is delcared CDATA can actually have any string of characters as its value. The value can be represented in an attribute specification in one of two ways: (1) if the value is a nametoken, you can just stick it in there as-is: <a href=xyz.html> (2) otherwise, stick quotes around it and make it an attribute value literal. And to work around the rules for interpreting attribute value literals (see 7.9.3 "Attribute Value Specification"), represent the quotes as numeric character references or entity references. And be sure to suitably escape whitespace too. Some examples: C literal SGML attribute value literal "" "" "abc" "abc" "a'b" "a'b" "a\"b" "a"b" -- depending on doc. charset. "\n" "
" -- depending on C locale "a b" "a&SPACE;&SPACE;&SPACE;b" "a\tb" "a&TAB;b" Blech... this is really tedious, isn't it? I'm tired. Good night. Dan