Re: SGML and HTML from Jukka K. Korpela on 2003-09-29 (www-validator@w3.org from September 2003)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Mon, 29 Sep 2003 19:17:52 +0300 (EEST)
To: olafBuddenhagen@web.de
Cc: www-validator@w3.org, antrik@gmx.net
Message-ID: <Pine.GSO.4.58.0309291851350.2571@korppi.cs.tut.fi>
On Mon, 29 Sep 2003 olafBuddenhagen@web.de wrote:

> I'll say as much as: "Correct" by common sense...

I vaguely remember someone having once said that "common sense" is the
thing that tells us that the earth is flat. When used to define
correctness, common sense hardly more than either everyone's personal
opinion or one's idea of what the majority thinks. Such things have little
to do in SGML, but in HTML they have had some success.

[ regarding how much SGML should be implemented: ]
> I was thinking of: As much as possible without having to bloat the
> browser considerably. Net mode is OK, empty tags are acceptable. Missing
> start tags are bad, but probably can be worked around without really
> implementing them. (Which every browser does more or less anyways...)
> Other SGML features would be too complicated.

To be honest, I don't think you should consider writing a browser if you
cannot parse SGML. Parsing is among the trivial things, even though
people who write browsers have managed to do it wrong. There are existing,
general-purpose SGML parsers around.

And "missing" (i.e., implied) start tags _are_ part of common tag soup
HTML, and common browsers deal with them - without much above their
average for bugs.

> > However, I don't think there is much that can be implemented without
> > breaking anything and after all, you cannot devolp a conforming HTML
> > 4.01 user agent and still support XHTML 1.0, so W3C probably does not
> > want to see conforming HTML 4.01 user agents in the wild.
>
> Why not, am I missing something?...

You're probably missing the fact that W3C recommendations are, to some
extent, mutually contradictory and even self-contradictory, and, besides,
they contain some statements that are not really meant to be taken at face
value.

And of course you can write a user agent that processes both HTML 4.01 and
XHTML 1.0. To make it formally correct, you need to program it to make an
informed decision on the parsing mode. This is different from the fact
that a _document_ cannot conform to both specifications.

But it's hardly _useful_ to parse HTML correctly, because virtually
everyone does it wrong. This is one of the reasons why validators are of
limited usefulness and often just waste of time or worse.

> > >And is it OK to report constructs that are handled incorrectly by
> > >most browsers as "errors"?
> >
> > It is not ok to report something as an error that is no error. Call it
> > a warning.
>
> Actually, as I have stated already, we can be pretty sure in most
> situations it *is* an error on the author's side.

Call it a mistake. People can do something right when they try to do
something wrong but make a mistake.

But instead of splitting hairs, consider the point that even though we can
usually _guess_ right in the sense that the author did not really _mean_
what he wrote, we might still not guess right what he _meant_.

What a validator shall do is to report either validity or reportable
markup errors. Nothing less, nothing more.

What a checker might do is to report any markup errors, possibly with
notes that say that some of them don't really matter, and to issue
miscellaneous notes about constructs that are known to be not supported by
major browsers. The use of "advanced" SGML features would surely be among
them, but in practice such use (even by accident) is rare, and other
features are far more important. At present, a checker should warn about
<object> and <q> and <acronym>, for example.

What a browser should do depends on its goal. If a document complies with
HTML 4.01, I see no reason why it should not be interpreted according to
that specification, and rendered in a manner corresponding to that, even
if we know that virtually all other browser choke on it.

> In this case, how can the claim be upheld that XML is a compatible
> subset of SGML, and can be parsed by any SGML processor?...

Whether you call XML a compatible subset of SGML depends on how seriously
you take the retrofitting operation that "legalizes" XML rules in the SGML
framework.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Monday, 29 September 2003 12:17:55 UTC