Re: assorted HTML and SGML questions

In message <TueOct1022:30:201995jbw@cs.bu.edu>, Joe Wells writes:
>
>I've got some questions that can probably be answered by an expert without
>even thinking but which I haven't been able to find the answers to in my
>WWW browsing.

No fair! I spent at least 6 months figuring this stuff out way back
when! You want me to give you the answers just like that? :-)

Actually, these are pretty good questions. It's a shame that the
answers aren't easier to find. The comp.text.sgml is a good place
to take questions like this -- that's where I found my answers.

I'm working on a tech report that bites off a chunk of these questions
and explains them:

"A Lexical Analyzer for HTML and Basic SGML"
http://www.w3.org/pub/WWW/MarkUp/SGML/sgml-lex/sgml-lex.html

Now to your questions...


>Q: (("text/html" Internet Media Type)) Does text/html forbid including the
>   SGML declaration (<!SGML ...>)?

My feeling is "yes, this is prohibited," but now that I think about
it, I suppose there's no spec I can point to that says so.

The reason that the HTML 2.0 spec didn't say straight out "Thou shalt
use the SGML declaration in appendix XXX" is the work on
internationalization. I18N involves some changes to the SGML
declaration.

An HTML user agent should know _exactly_ what SGML declaration is in
effect once it's parsed and processed the media type and parameters
(i.e. the Content-Type: header field). If you see no parameters, then
the charset defaults to ISO-8859-1, and the SGML decl in the back of
the HTML 2.0 spec is in effect. If you see charset=XXX, you have to
see the recent html-i18n drafts.

I suppose it could be specified more clearly now that some of the I18N
questions have been answered. But it was the best info I had at the
time.

>   Does it require that a PUBLIC external
>   identifier (i.e. PUBLIC "-//IETF//DTD HTML Level 2//EN") be included in
>   the DTD, if the DTD is included?

Yes.  The relevant section of the HTML 2.0 spec is:

"HTML Public Text Identifiers"
http://www.w3.org/pub/WWW/MarkUp/html-spec/html-spec_3.html#SEC16

|To identify information as an HTML document conforming to this
|specification, each document must start with one of the following
|document type declarations.
|
|<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
|<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Level 2//EN">
|<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Level 1//EN">
|<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Strict//EN">
|<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0 Strict Level 1//EN">

Those are your choices, if you want to conform exactly to the spec.

>   Does it forbid including a DTD
>   subset?

Yes. See above.


>Q: ((SGML Marked Sections)) The syntax for marked sections is not clear to
>   me.  I would like to know precisely how to determine when the end of a
>   marked section has been reached.

Look for "]]>". I think nested marked sections work as expected
(i.e. NOT like C comments).

>   I've seen two grammars for this, one
>   from TEI (which is clearly wrong and disagrees with what "sgmls" does)
>   and one based on the standard which merely says the content of the
>   marked section is "SGML characters" (which is not helpful).  What is
>   the precise syntax for marked sections?

I'm afraid it's not specified in machine-readable form. It's only
specified in the prose of the SGML spec. You could get Goldfarb's
handbook and see if that makes more sense.

>   (Pointers to *net* resources
>   are greatly preferred to paper resources.  Pointers to source code
>   should be to well-commented and clear source code; I've already tried
>   to figure this out by reading the source of sgmls.)

I'm afraid the source code to sgmls or sp is the best source of
info on this stuff that I've ever seen on the net. sgmls is the
result of James Clark doing a bunch of release-engineering on
some old code. Many folks don't find it very readable. SP is a
from-scratch re-write. I'd expect it to be more readable, but I
haven't looked closely. You might also try the YASP source code.
All of these are available in the ftp.ifi.uio.no archive, maintained
by Erik Naggum.


>Q: ((HTML 3.0 with HTML.Recommended vs. Legacy Documents)) In HTML 3.0
>   with HTML.Recommended enabled in the DTD, it is illegal to put text
>   directly inside an LI element, like this:
>
>     <LI>Here are some words.</LI>
>
>   This is legal:
>
>     <LI><P>Here are some words.</P></LI>

I think this is a mis-feature in the recent HTML 3 dtd. Dave Raggett
likes it as is.


>   Is the first fragment supposed to be interpreted (rendered) like the
>   second one by an HTML browser?

I suppose so. But we need a rigorous specification of it somewhere. I
don't like this "required error-handling" stuff in the HTML 3 DTD.

>   I've noticed that many browsers
>   (e.g. Netscape 1.1N) treat them very differently.  Netscape in
>   particular renders the second, legal version in a truly horrible
>   fashion.

Hmmm... it seems like a bug to me too.

The HTML 2.0 spec has some verbage in the conformance clause to the
effect that a user agent isn't allowed to act differently on documents
that have the same parse tree. e.g. comments aren't allowed to have
any effect. Nor are </li> or </p> tags.

But by the HTML 2.0 DTD, those two fragments produce _different_ parse
trees. The first is an LI element with some data characters in it, and
the second is a LI containing a P element which contains data
characters. So the HTML 2.0 spec doesn't require that a user agent
treat them the same. (Good taste might dictate that they be treated
the same, however.)

According to HTML 3.0 in strict mode, the first example is illegal.  I
think it's Dave's intent that it be treated like the second example,
but I can't point to the spec that says so.


>Q: ((HTML 3.0 TEXTAREA vs. Inclusion Exceptions))
[...]
>   using the proposed DTD the following HTML is valid:
>
>     <FORM ACTION="http://dev.null.dom">
>       <P>
>         <MATH>
>           <TEXTAREA NAME="foo" ROWS=1 COLS=1>
>             <SPOT ID="bar">
>             <BOX>
>               yyy<SUP>
>                    zzz
>                  </SUP>
>             </BOX>
>           </TEXTAREA>
>         </MATH>
>       </P>
>     </FORM>

I don't think that's supposed to be valid. It's a bug in the DTD if it
is, I'd say. You might send mail to <dsr@w3.org> or
<markg@halsoft.com> or both -- Dave maintains the DTD, but Mark
Gaither maintains the validation service, and he's helping us track
and deal with public feedback.

>Q: ((SGML Unclosed Start and End Tags)) Under what circumstances are
>   unclosed start and end tags allowed?

http://www.w3.org/pub/WWW/MarkUp/html-spec/html-spec_foot.html#FOOT6

|The SGML declaration for HTML specifies SHORTTAG YES, which means that
|there are other valid syntaxes for tags, such as NET tags, `<EM/.../';
|empty start tags, `<>'; and empty end-tags, `</>'. Until support for
|these idioms is widely deployed, their use is strongly discouraged.

In other words, "don't do that."

Basically, the fact that short-cuts for attributes are lumped in with
short-cuts for tags is a mis-feature of SGML that's going to be fixed.

See:

gopher://menja.ifi.uio.no:70/00/pub/SGML/ISO8879-rev/N1605.txt


>Q: ((HTML 3.0 Dummy Elements)) In HTML 3.0, what is the purpose of having
>   the BODYTEXT and FIGTEXT elements at all?

To avoid the "mixed content" snafu. This is an SGML FAQ. I'll let Joe
English or somebody else fill in the details here.

>Q: ((My SGML Confusion)) What is "SDATA"?

Err... "system data" or something. Who knows, really.


>Q: ((SGML vs. Carriage Returns)) The documentation for the program "sgmls"
>   says that it does this:
>   
>       1.     each carriage return character  is  turned  into  a
>              non-SGML character;
>
>       2.     each  newline character is turned into a record end
>              character, and at the  same  time  a  record  start
>              character  is  inserted  at  the  beginning of each
>              line;
>
>   Is this part of the standard?

Another SGML FAQ. This is a consequence of section 7.6.1 "Record
Boundaries" in the SGML spec. My term-of-endearment is "phase-of-the-moon
processing of newlines." Have a look at the comp.text.sgml archives
and the html-wg archives.

>   Is this an appropriate thing to do for
>   unix compatibility because the convention on unix is that lines are not
>   started by anything and are ended by newlines?

Basically, yes.


>Q: ((SGML Grammar Confusion)) The grammar of SGML that I have seen says
>   one alternative for an "attribute value" is "character data".  This
>   seems very open-ended and unspecified.  What does this mean?

Well, an attribute whose value is delcared CDATA can actually have
any string of characters as its value. The value can be represented
in an attribute specification in one of two ways:

(1) if the value is a nametoken, you can just stick it in there as-is:

	<a href=xyz.html>

(2) otherwise, stick quotes around it and make it an attribute value
literal.  And to work around the rules for interpreting attribute
value literals (see 7.9.3 "Attribute Value Specification"), represent
the quotes as numeric character references or entity references. And
be sure to suitably escape whitespace too.

Some examples:

	C literal		SGML attribute value literal
	""			""
	"abc"			"abc"
	"a'b"			"a'b"
	"a\"b"			"a&#34;b"     -- depending on doc. charset.
	"\n"			"&#10"        -- depending on C locale
	"a   b"			"a&SPACE;&SPACE;&SPACE;b"
	"a\tb"			"a&TAB;b"


Blech... this is really tedious, isn't it? I'm tired. Good night.

Dan

Received on Tuesday, 10 October 1995 23:59:35 UTC