Doctypes with "[" after public identifier

Boris Zbarsky wrote:
> On 2/17/10 4:29 AM, Philip Taylor wrote:
>> Yes, but in pre-HTML5 browsers (IE, Firefox 3.6 without html5.enable,
>> etc) doctypes will still only be parsed up to the *first* ">", so you
>> will get the characters "]>" inserted as text into the body of the
>> document
> 
> That's the case with the HTML5 parser as well, no?

Yes - that aspect of the parsing hasn't changed.

(I think the only browser that attempts to parse this differently is 
Opera, which seems to ignore any ">" unless it has previously seen an 
equal number of "[" and "]" characters (in any order).)

> I agree with Julian's concern: going from treating a doctype as 
> standards to treating a doctype as quirks seems like a bad idea to me.

As a first approximation, changes are bad. As a second approximation, 
changes are bad if they break existing content. It's not clear what 
behaviour here will break least.

The specific case is "[" after the public identifier, and before the 
system identifier. This can't happen in well-formed XML (the system 
identifier is required, and the internal subset comes after it), though 
I've heard that SGML allows it. It's handled in HTML5 
(http://whatwg.org/html#between-doctype-public-and-system-identifiers-state) 
exactly like any other bogus character (i.e. forcing quirks mode), but 
Firefox appears to have a special case for "[" in this location 
(preventing quirks).

Looking through half a million pages for the pattern

   (?i)<!doctype\s+html\s+public\s+"[^"]+"\s*\[

results in two sites:

   http://www.freemanforman.co.uk/
     <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
[url=http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

   http://symptomresearch.nih.gov/
     <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" []>

Looking for interesting pages on those sites:

http://www.freemanforman.co.uk/content/001_Area_Search/ - in Firefox 
3.6, the map renders incorrectly (it's positioned too far up/right and 
clipped) if html5.enable is *on* (which triggers quirks mode).

http://symptomresearch.nih.gov/grantopportunities.htm - the menu items 
are too widely spaced and the skip link underlines are visible when 
html5.enable is *off*.

So something breaks in Firefox either way. Possible options:

  * Ignore this, under the belief that minor breakage of 0.001% of sites 
(which have bogus doctypes and are already broken in some browsers) is 
not worth spending more time on.

  * Collect more data about whether special-casing "[" would cause more 
breakage or less breakage, and adjust the spec accordingly. (Probably 
need to look at tens or hundreds of millions of pages to get a good 
idea, since it's so rare.)

  * Make additional changes to the doctype logic so both of these pages 
can render correctly.

Filed as http://www.w3.org/Bugs/Public/show_bug.cgi?id=9071

> -Boris

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Thursday, 18 February 2010 16:47:41 UTC