W3C home > Mailing lists > Public > whatwg@whatwg.org > July 2006

[whatwg] [WebApps] Parsing: bogus DOCTYPE state

From: Ian Hickson <ian@hixie.ch>
Date: Wed, 19 Jul 2006 00:20:44 +0000 (UTC)
Message-ID: <Pine.LNX.4.62.0607182348100.4826@dhalsim.dreamhost.com>
On Mon, 17 Jul 2006, J. King wrote:
>
> The bogus DOCTYPE state consumes all characters until it gets to EOF or a '>'
> character.  I presume this means that the following DOCTYPE:
> 
>  <!DOCTYPE html blah "http://some<invalid>URI">
> 
> ...would finish at the first > and emit character tokens for 'URI">'.

Correct. That's compatible with the rendering that that DOCTYPE causes in 
Safari, Opera, and Mozilla. (In Mozilla the DOCTYPE actually ends at the 
"<", so you have an <invalid> element in the DOM too. In Safari the 
DOCTYPE can end at a "<" only if it preceeded by a space. The spec 
doesn't have any "<" magic for DOCTYPEs.)


> Similarly, I imagine this sequence:
> 
>  <!DOCTYPE html blah <html lang="en"><head>
> 
> ...would not produce a start-tag token for 'html'.

Correct, although in Mozilla and Safari it actually does. I doubt this is 
a big deal since in IE there is, as you propose, somewhat more complex 
DOCTYPE parsing at work, and so the DOCTYPEs end up containing the 
entirety of your examples. (Of course, IE then treats them as comments, 
not as DOCTYPEs, in the DOM.)


> Is this what browsers do, or is this an oversight?

It's compatible with what some browsers do. It was intentional, at least. 
I believe it's actually compatible with the SGML parsing rules, too, 
though I may be mistaken about that and don't have a copy of Goldfarb 
around to check.


> Even if it -is- what browsers do, this behaviour would lead conformance 
> checkers to report the wrong kinds of errors; I would suggest a more 
> complex parsing of DOCTYPEs is necessary.

Well, anything other than <!DOCTYPE HTML> is invalid, so there'll already 
be at least one parse error -- the DOCTYPE being invalid. Conformance 
checkers are, of course, allowed to go out of their way to make their 
errors more understandable.

FWIW, my implementation, which has had very little work put into its 
error handling, reported:

   16: Parse error: unexpected character while tokenising end of DOCTYPE. 
   41: Parse error: errorneous document type declaration.

...on your first example, and:

   16: Parse error: unexpected character while tokenising end of DOCTYPE. 
   36: Parse error: errorneous document type declaration.

...on your second (and no other errors). Those don't seem like the wrong 
kinds of errors. :-)

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Tuesday, 18 July 2006 17:20:44 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:58:47 UTC