Re: Doctypes with "[" after public identifier from Leif Halvard Silli on 2010-02-19 (public-html@w3.org from February 2010)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Fri, 19 Feb 2010 05:29:15 +0100
To: Philip Taylor <pjt47@cam.ac.uk>
Cc: Boris Zbarsky <bzbarsky@MIT.EDU>, "public-html@w3.org" <public-html@w3.org>
Message-ID: <20100219052915843122.4d1c8596@xn--mlform-iua.no>
Philip Taylor, Thu, 18 Feb 2010 16:47:13 +0000:
> Boris Zbarsky wrote:
>> On 2/17/10 4:29 AM, Philip Taylor wrote:
>>> Yes, but in pre-HTML5 browsers (IE, Firefox 3.6 without html5.enable,
>>> etc) doctypes will still only be parsed up to the *first* ">", so you
>>> will get the characters "]>" inserted as text into the body of the
>>> document
>> 
>> That's the case with the HTML5 parser as well, no?
> 
> Yes - that aspect of the parsing hasn't changed.
> 
> (I think the only browser that attempts to parse this differently is 
> Opera, which seems to ignore any ">" unless it has previously seen an 
> equal number of "[" and "]" characters (in any order).)

Konqueror also has zero problems with the superfluous "]>". 

A positive thing with the HTML5 parsing model is that it becomes 
simpler to hide the "]>", via "comment tricks". For example, I tried to 
replicate what I managed to do inside the HTML4 doctype  inside a 
XHTML1 doctype. And it was quite easy  - except in Firefox  (due to 
stricter comment rules in XHTML, which allows fewer tricks). But as 
soon as I turned on HTML5.enable, then it worked nicely in Firefox as 
well - the "]>" became hidden.

(Of course, would be better if it disappeared completely ...)

>> I agree with Julian's concern: going from treating a doctype as 
>> standards to treating a doctype as quirks seems like a bad idea to 
>> me.
> 
> As a first approximation, changes are bad. As a second approximation, 
> changes are bad if they break existing content. It's not clear what 
> behaviour here will break least.

You pointed to an example which you claimed looked better in 
QuirksMode. OTOH, you said that there were so few such examples, that 
they hardly count. It seems better to operate with a principle. And: 
Those who are using [] inside the DOCTYPE probably do not tend to do so 
because they want quirksmode.

> The specific case is "[" after the public identifier, and before the 
> system identifier.

Perhaps only technical terms from your part, but my example page had a 
doctype _without_ any system identifier:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" 
[<!ATTLIST P myattr   CDATA #implied >]>

This is only allowed for HTML4 -not XHTML.

> This can't happen in well-formed XML (the system 
> identifier is required, and the internal subset comes after it), 
> though I've heard that SGML allows it.

I don't think the system identifier can come after the internal subset. 
Can you show a valid such doctype in the validator?

However, _this_ is a valid HTML4 doctype:

<!DOCTYPE HTML PUBLIC 
     --comment--
"-//W3C//DTD HTML 4.01//EN" 
     --comment--
[
  <!ATTLIST P myattr   CDATA #implied --comment-- >
]
     --comment--
>

In Safari 4 it triggers quirks mode - probably due to HTML5 
preparation. In Firefox with HTML5.enable too. And in Opera 10.5beta as 
well. But not in IE. Not in Firefox without HTML5.enable. Not Opera in 
10.10. None of the legacy/current browsers - except Safari.

We should revert the HTML5 behavior as soon as possible!

> It's handled in HTML5 
> 
(http://whatwg.org/html#between-doctype-public-and-system-identifiers-state) 
> exactly like any other bogus character (i.e. forcing quirks mode), 
> but Firefox appears to have a special case for "[" in this location 
> (preventing quirks).

When you say "Firefox appears to have", then you mean Firefox' HTML5 
implementation, I suppose?
 
> Looking through half a million pages for the pattern
> 
>   (?i)<!doctype\s+html\s+public\s+"[^"]+"\s*\[
> 
> results in two sites:
> 
>   http://www.freemanforman.co.uk/
>     <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
> [url=http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
> 
>   http://symptomresearch.nih.gov/
>     <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" []>

Mister Taylor: That is a transitional doctype. The reason it triggers 
quirks has to do with that, and is _not_ related to the "[]".

> Looking for interesting pages on those sites:
> 
> http://www.freemanforman.co.uk/content/001_Area_Search/ - in Firefox 
> 3.6, the map renders incorrectly (it's positioned too far up/right 
> and clipped) if html5.enable is *on* (which triggers quirks mode).

That doctype doesn't trigger quirks in Internet Explorer - at least.

> http://symptomresearch.nih.gov/grantopportunities.htm - the menu 
> items are too widely spaced and the skip link underlines are visible 
> when html5.enable is *off*.
> 
> So something breaks in Firefox either way. Possible options:

Gee. Are you saying that we can stop making transitional doctypes 
trigger quirks? ;-) (See above.)

>  * Ignore this, under the belief that minor breakage of 0.001% of 
> sites (which have bogus doctypes and are already broken in some 
> browsers) is not worth spending more time on.

I see no advantage in that. Except convenience - for Safari and Opera, 
which has attempted to implement the HTML5 spec more fully here.

>  * Collect more data about whether special-casing "[" would cause 
> more breakage or less breakage, and adjust the spec accordingly. 
> (Probably need to look at tens or hundreds of millions of pages to 
> get a good idea, since it's so rare.)

This is the wrong attitude: It currently/historically doesn't haven any 
effect on the parsing modus. And so we should not investigate whether 
we can get away with making it trigger quirks mode.

>  * Make additional changes to the doctype logic so both of these 
> pages can render correctly.

I clearly favour this option. There are likely too few pages to find 
out whether it breaks more or less to do the one or the other thing. 
However: We do know that current browsers _do not_ trigger quirksmode 
because of the []. That alone should be enough

Unless we choose that option, then HTML5 will result in more doctypes 
triggering quirks. And that would be a very funny result of the HTML5 
effort ...

> Filed as http://www.w3.org/Bugs/Public/show_bug.cgi?id=9071

Wonder why we needed two bug reports for this ...
-- 
leif halvard silli
Received on Friday, 19 February 2010 04:29:55 UTC