RE: Tag Soup (was: FW: XHTML)

On Fri, 3 Dec 1999, Dave Raggett wrote:
> On Fri, 3 Dec 1999, Jelks Cabaniss wrote:
> > Arjun Ray wrote:
> > 
> > > The tragedy is that a formal spec for Tag Soup was never written.
> > 
> > Especially since it's going to be around for a long, long time.

Indeed.  The trouble seems to be that the truth underlying this prognosis
is an ugly one, and the ugliness prompts a reluctance to acknowledge it.

> > Even if UAs next year onward reject any and all malformed documents
> > declaring themselves with XHTML DOCTYPEs and namespaces, if they
> > can't also grok Tag Soup, who in the General Public would want to 
> > use them?  

Yep.  Gresham's Law.

> > Content is what is important, and the GP cares less if the content
> > is encrypted in Tag Soup.
> > 
> > So UAs with real XML/SGML parsers will still need a TAGSOUP.DLL
> > for the foreseable future ...
> 
> My work on HTML Tidy was motivated by an attempt to deal with this
> by providing an Open Source solution for converting Tag Soup
> documents into something easier to process.

Respectfully, this misses the basic point.  Tidy is a wonderful program,
and no doubt useful - to those who care.  The issue is why many if not
most will not care, and the reason is that Tag Soup is *not* inherently
difficult to process - the program has merely to reflect the thought
process underlying the use.  

A Tag Soup renderer - flowing text according to a set of global flags
modified in "stream" fashion - is actually *easy* to write.  See a tag, do
something; no tag, no "action".  That's how and why </P> is supposed to
make a difference, as witness this non-justification of a non-problem:

 http://lists.w3.org/Archives/Public/www-style/1998May/0101.html
   
In fact, this is precisely what Mosaic did, and precisely why Mosaic
seemed so "robust".  It was too *stupid* to get into trouble.  Midas and
Viola would regularly crash on stuff that Mosaic "handled" with aplomb,
because they tried to do intelligent, often context sensitive things with
markup.  (Like collapsible lists - but what use was that when UL "meant"
indent and LI "meant" plunk-a-bullet?)  Mosaic's "innovation" was to
*reduce* potentially powerful markup to a small set of lo-tech, readily
apprehensible and "predictable" formatting primitives - skip a line,
indent/cancel, bold/ital/cancel, font size change/cancel, etc.  That was
why Andreessen and Bina tossed the libWWW design (which called for a
separate stylesheet driven rendering widget) in favour of their libhtmlw
"HTML widget" - a renderer that took *tags* directly as "commands".  As
long as each tag in isolation expanded macro-like to zero or more of the
(relatively orthogonal) behavioral toggles "supported" by the widget, it
didn't matter what dog's breakfast of a mishmash you fed it, it would
simply and stolidly "do what it was told".  That's the genesis of "HTML Of
The Month" nominations like this

  <p><br><br><br><p><p><br><br><p>

But the point to appreciate is the *thought process* of authors doing
stuff like this - what they *expected* and were gratified to see "work":
the concept is "skip-a-line", so it doesn't matter what one calls it, if
it takes voodoo incantations like <p> and <br>, so be it.  The Mosaic
paradigm was to support the thought process faithfully.

It was also no surprise that "a lot of tags seem to do the same thing" -
UL OL and DD all got you nice indents - and I'll hazard the guess that the
reason why Netscape invented <FONT> and <CENTER> but not <INDENT> is that
one of their Bright Sparks must have said "They have more than one way to
do that already: This is not Rocket Science!"  And, indeed, a few seasons
later, this would be why Netscape Composer *generated* <DD> in response to
a request for an indent.

Dismmissing Tag Soup with a snort does not diminish the fact that it can
be internally consistent.  The syntax is entirely secondary.  In fact,
that's why Javascript had document.write() from the beginning - to write a
stream of commands - tags! - back into the renderer.  More the pity that
the tags have to be between '<' and '>' - someone might think that SGML
was involved.

> One problem in writing a formal spec for tag soup is that there
> are significant differences between Navigator and IE. Microsoft's
> reverse engineering team got it close, but not close enough. 

True.  They also made the mistake of trying to rationalize what at root
was just beer-and-pizza coding (that is, they found more method to the
madness than there really was.)

However, the basic features of the spec are not difficult, to set down.

  http://lists.w3.org/Archives/Public/www-html/1999Oct/0053.html

The "meta-spec" would appeal to a stream-based processing model, where
aspects of a global processing state (margins, font size, color, etc.) are
impacted by commands embedded in flowable text.  These commands are
syntactically distinguished from data by the marks '<' and '>'.  (Didn't
TimBL once write a "rant" about this, chiding the "markup person"?)  The
actual Tag Soup spec could then list expected behaviors.  For historical
reasons, it becomes necessary to contend with the fact that a lot of these
commands are utterly inscrutable - UL, /DL, DT, LI, whatnot - when it
might have been simpler just to have <FONT>, <SKIP>, <INDENT> and so on,
but them's the breaks.

Differences between the Tweedles would, admittedly, pose a "political"
problem.  For instance, <FONT><TABLE>...</TABLE></FONT>.  Arguably, IE's
treatment (to "honor" the font-spec - or is that Navigator) is the more
"logical" one, in terms of how the Tag Soup *mindset* would expect things
to work. 

> In any event, few people have expressed a common need for such a spec.

On the contrary, a spec that *meaningfully* captures the behavior of the
popular wowsers is precisely what plenty of people are calling for.  The
SGML formalism simply does not fit that bill.

> Discussions over time in the various HTML working groups have tended
> to be prescriptive, focussing on how people should write rather that
> what they do write in practice. Browser implementers are required to
> take a more pragmatic view though, and the existing specs are just
> the tip of the iceberg.

The existing specs are supremely irrelevant.  We knew that a long time
ago.

  http://www.nyct.net/~aray/htmlwg/stds.html
  http://www.nyct.net/~aray/htmlwg/rcs.html


Arjun

Received on Sunday, 5 December 1999 06:40:14 UTC