RE: Tag Soup vs Generalized Markup (was: I-D ACTION..) from Arjun Ray on 1999-10-08 (www-html@w3.org from October 1999)

From: Arjun Ray <aray@q2.net>
Date: Fri, 8 Oct 1999 18:50:04 -0400 (EDT)
To: www-html@w3.org
Message-ID: <Pine.LNX.3.95.991008172507.29982V-100000@mail.q2.net>
On Fri, 8 Oct 1999, Larry Masinter wrote:

>    In addition to the development of standards, a wide variety of
>    additional extensions, restrictions, and modifications to HTML were
>    popularized by the competitive implementations of Netscape
>    Navigator and Microsoft Internet Explorer and documented in various
>    books and online guides.

Maybe only "historical" names are best?  So, replace

>    popularized by the competitive implementations of Netscape
>    Navigator and Microsoft Internet Explorer 

with

:    popularized by competitive implementations derived mainly from
:    the Mosaic browser of the NCSA [add reference]
 
> I would be happy to include a reference to a "Tag Soup spec" if
> I could find one that would be suitable for a references list
> in an RFC. 

Actually, I was suggesting that a Tag Soup spec be written for the I-D to
point to.

> I'm uneasy about recommending one popular HTML book over another,

I agree.  I doubt a suitable book could be found, because all such books
are about "how to use HTML" rather than the dry details that go into a
spec.

> and can't find any stable reference to something that would constitute
> an "official guide to Mozilla and/or MSIE HTML tags". 

Jukka seems to have covered that ground reasonably well, so let me
elaborate on what I meant by a Tag Soup spec.  It's probably best
understood in terms of how Mosaic's parser used to work, and the "normal"
way in which common word processing software is used (keep on typing, and
where needed, smack a function key or toolbar button to insert a "command
code".  The "Reveal Codes" feature of WordPerfect is perhaps canonical in 
this respect.) 

Basically, an HTML document is treated as a flat stream of text punctuated
by "marks"; each "mark" involves a collection of toggles and/or counters
aimed at a "global processing state".  By design, these primitives should
be orthogonal, but they may interact in ad hoc ways; even so, the idea is
to avoid as far as possible ever having to "stack".  In any case, each
individual HTML tag should be independently treated as a macro expanding
to these commands.

For instance, the header tags could all be treated as affecting a global
value of "font size", with a default re-established upon "cancellation"
via an end-tag.  Any such end-tag, in fact, so that something like this
should work swimmingly well: 

    <h2>Hello <h3>World!</h1>

(Read: change to font size h2, print "Hello ", break a line and a half and
change to font size h3, print "World!", break two lines and cancel font
changes to reset font default) 

or something like this if bold and italics can be independently varied:

    <b>bold stuff<i>and italicized</b>just italics</i>

The fact that this represents a fundamental misunderstanding of SGML
syntax is irrelevant.  The outward form of the borrowed syntax is being
mapped to a different mental model.  As Eric Bina was known to say: "This
is not Rocket Science". [I nominate this for an epigrammatic quote should
the spec elect to have one.]

The mental model, in turn, is oriented towards a bunch of styling
primitives.  So, we would need a taxonomy of the various "marks", perhaps
in alphabetical order for easy reference, and with notes on potential
-ahem- "interactions".  For example, UL is really just some geek's idea of
obfuscating the plain English word "indent"; most of the time LI (more
obfuscation for "smack-bullet") is found after it, so the section on LI
should mention that it's advisable to always indent bullets.  Another
example would be how DD ("wide indent") is customarily cancelled by /DL.
And so on.

If all this is making anyone uneasy, let it be noted that the source code
for Mosaic was always available (at least the X version), so what was
going on was no secret, and yet there were few if any complaints (even on
the www-talk mailing list.)  There are situations where silence implies
approval. 

So, the Tag Soup spec consists of three parts, which could be factored
into separate documents.

1. A lexical specification.

This deals directly with tag syntax.  Dan Connolly's paper on sgml-lex is
an excellent model, were all the SGML references removed. 

  - no need to explain selected="selected" or ismap="ismap", and no
    need to mention that <h1 center> doesn't work.
  - quoting attribute values can be made "functional": needed only to
    prevent misparsing of whitespace or '>'.
  - "Comment tags" Made Easy.
  - <!junk decl> considered legal.
  - no need to have stuff "forbidden by this report" (PIs, Marked
    Sections, etc,)

and so on.

2. An Interaction specification

  - common combinations of marks, eg. UL + LI and DD + /DL
  - known "no-ops" such as /LI and /DT

This is where on-line guides and the like could prove useful.

3. A Semantic Specification

The 4.01 spec with all SGML removed: just a listing of names and intended
meanings.

The new I-D could point to a covering document pointing to these three
parts, and thus avoid the need to provide references directly.


Arjun
Received on Friday, 8 October 1999 18:06:27 UTC