RE: Doctype detection from Dave J Woolley on 2000-07-26 (www-html@w3.org from July 2000)

From: Dave J Woolley <DJW@bts.co.uk>
Date: Wed, 26 Jul 2000 19:52:16 +0100
To: www-html@w3.org
Message-ID: <81E4A2BC03CED111845100104B62AFB5824887@stagecoach.bts.co.uk>
> From:	Jan Roland Eriksson [SMTP:jrexon@newsguy.com]
> 
>  "The HTML 2.0 specification ([RFC1866]) observes that many
>   HTML 2.0 user agents assume that a document that does not
>   begin with a document type declaration refers to the
>   HTML 2.0 specification. As experience shows that this is a
>   poor assumption, the current specification does not recommend
>   this behavior."
> 
	[DJW:]  That's just a statement of the de facto
	situation that very few documents have valid doctypes;
	many have none, many from a year or two ago, have one
	that equates to HTML 2.0, but are authored to something
	like HTML 4 Transitional.

	I even looked at the web site of the one of editors of
	a recent W3C document, and that of their employers++.  The
	latter had an incorrectly capitalised HTML 4.0 (Strict)
	doctype, but was actually authored in invalid HTML 4.0 
	Transitional.

	One of the former had the doctype after the head section,
	and another failed to honour its (XHTML) doctype.

> And there's no problem what so ever to design an excellent stylesheet
> suggestion, using contextual selectors, for a strict HTML2 doc.
	[DJW:]  
	As presentation is outside the scope of HTML 2 and
	LINK is open ended, I have no qualms about adding an
	external style sheet to HTML 2 documents!

> Don't use "doctype-sniffing" for the wrong purpose, doing that
> will only create a new set of problems that we need to discuss
> again some years from now.
> 
[DJW:]  I can't think of anywhere where a conforming
HTML 4 parser would mis-parse a conforming HMTL 2 document,
although I can think of one case (radio buttons) where there
could be a significant semantic difference between it and
an HTML 3.2 document.  Especially given that popular authoring
tools mislabelled HTML 4 as HTML 2, I'd think it naive to
expect content to be correctly labelled when this mattered, 
or browsers to care about backward compatible behaviour.

The problem comes with non-conforming documents authored for
HTML 2 etc.; things like comment syntax have been enforced more
strictly in later versions (Lynx has two different broken 
comment parsing modes!) and tag soup structures make less 
sense.  However, I think that that is a commercial issue
for browser writers (who encouraged the problems in the first
place).

It would probably be much better for a browser to use heuristics
to detect the need for "tag soup" parsing and broken comment
rules, either after detecting an error on a strict parse, or,
in spite of a good parse, because of, for example, multi-line
comments containing apparent tags, entities preceding = signs
in hrefs, etc.  Whilst I don't particularly like the idea
of them applying such rules for a good parse, and suggest it
should be possible to disable them, I think they will be
neccessary for a long time.

I don't think it is the job of standards to make rulings on this,
because that will just discredit the standards when they are
not implemented, but I think the standards documents should
point out common abuses that might need error recovery, and
should advise that browsers indicate on the status bar, or
equivalent, that error recovery had to be used, so that users
become more aware of bad HTML.

++ I think it is fairly well known that the people in companies
that get involved with standards often have little control
over the marketing people.  (Actually, none of the company
sites of recent contributors and some major W3C members pass
the W3C validator, mine included; my home one does, as does
W3C's.)
-- 
--------------------------- DISCLAIMER ---------------------------------
Any views expressed in this message are those of the individual sender,
except where the sender specifically states them to be the views of BTS.
Received on Wednesday, 26 July 2000 14:52:31 UTC