Re: SGML and HTML from Jukka K. Korpela on 2003-09-04 (www-validator@w3.org from September 2003)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Thu, 4 Sep 2003 19:36:51 +0300 (EEST)
To: www-validator@w3.org
Cc: olafBuddenhagen@web.de
Message-ID: <Pine.GSO.4.50.0309041920370.11495-100000@korppi.cs.tut.fi>

On Thu, 4 Sep 2003 olafBuddenhagen@web.de wrote:

> Recently I realized that some constructs I considered a syntax error,
> are actually allowed in SGML.

Such things come as surprises to the great majority of Web authors.
I think this is the practical reason behind the idea of turning the
validator into a heuristic checker.

> While examining this further, it turned
> out SGML allows many things that would be considered extremely broken
> HTML by common sense;

Well, they are just dangerous to browsers

> and the HTML standard while *recommending* usage
> only of a few widely accepted SGML constructs, doesn't really seem to
> forbid anything allowed in SGML...

Well, the HTML 4.01 "standard" claims that HTML is an application of SGML,
but extremely few browsers have even tried to implement HTML that way.
Maybe it would be better if the HTML "standard" simply defined HTML
as an SGML application _without certain SGML features_ that can be dropped
out when an SGML application is defined. It's a mystery to me why those
features were included in the first place. Perhaps in a spirit of
optimism, or ignorance (of what they really mean). Or perhaps just to let
the pedants/realists note in discussions that (almost) _no_ browser
implements even HTML 2.0.

Should the "standard" be changed, then? Well, they just did. XHTML makes
the verbose canonical syntax a must. So it's somewhat strange that we are
now _also_ witnessing a change where a purported validator is changed to
process _documents declared as HTML 4.01_ according to "practical rules".

> Is this really the right conclusion: Is everything that's legal in
> SGML/accepted by the validator automaticaly valid HTML,

By definition, yes. Unless you wish to assign confusing meanings to
"valid" in this context, but then everyone and his dog can assign a new
meaning to "valid".

The word "valid", and its relatives, was probably poorly chosen for use
with SGML. It means nothing but 'syntactically correct (to the extent that
the syntax has been defined in a formalized manner)'.

> even if this
> allows creating perfectly valid HTML documents that no existing browser
> will handle correctly?

Software errors, no matter how common, do not change the meanings of SGML
terms.

> Another interesting issue is error recovery. The validator for example
> seems to immediately stop tag parsing and continue in content parsing
> when encountering an illegal character, but skips misformed attribute
> values or invalid declarations.

That depends on how the validator has been coded. It could be improved in
many ways, but intelligent error recovery is really difficult.

> Is there some kind of specification or
> recommendation for such behaviour, or is the browser free to handle
> syntax errors as it likes?

The HTML 4.01 specification imposes no requirements on error handling.
If you ask me, that's a mistake, but that's the HTML tradition.
XML has different rules, but we'll see seriously they will be taken.
I don't think browsers even try to play by XML rules for XHTML documents.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Received on Thursday, 4 September 2003 12:36:54 UTC