Re: several messages about the HTML syntax from Ian Hickson on 2008-03-02 (public-html@w3.org from March 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Sun, 2 Mar 2008 23:02:07 +0000 (UTC)
Cc: whatwg@whatwg.org, "public-html@w3.org WG" <public-html@w3.org>
Message-ID: <Pine.LNX.4.62.0802290920310.6407@hixie.dreamhostps.com>
Executive summary:

 * Changed the &rang; and &lang; entities (which we'd already changed 
   anyway) to something more appropriate. (r1286)

 * Made a number of things parse errors to allow conformance checkers to 
   catch common attribute mistakes. (r1292, r1293, r1299, r1303)

 * Made a number of changes to parsing for compatibility reasons: entities 
   no longer get parsed betwen comments in RCDATA elements, three more 
   ways to trigger quirks mode, made DOCTYPE parsing not trigger quirks 
   mode if there's trailing garbage (r1294, r1302, r1306)

 * Made entities at the end of an attribute be not a parse error. (r1296)

 * A number of editorial changes. (in range r1286 - r1307)


On Fri, 29 Jun 2007, Henri Sivonen wrote:
> On Jun 29, 2007, at 11:59, Simon Pieters wrote:
> > 
> >    U+003E GREATER-THAN SIGN (>)
> >       Parse error. Set the DOCTYPE token's correctness flag to incorrect.
> >       Emit that DOCTYPE token. Switch to the data state.
> 
> Should the string (public id or system id) that was being built be 
> dropped on the floor as well?

On Fri, 29 Jun 2007, Simon Pieters wrote:
> 
> I don't see a good reason to drop it. The doctype's correctness flag is 
> set to incorrect anyway. But I don't feel strongly about it either way.

Agreed.


On Fri, 29 Jun 2007, Simon Pieters wrote:
> 
> IE seems to not emit the token for > that is in quotes anywhere for both 
> doctypes and bogus comments (or it treats doctypes as bogus comments):
> 
>    <!doctype ">" >
>    <! ">" >
>    <? ">" >
>    </ ">" >
> 
> This does not apply to these:
> 
>    <!-- "-->" -->
>    <% "%>" %>

Yeah, I don't think we want to capture IE's complex rules here.


On Sun, 1 Jul 2007, �istein E. Andersen wrote:
>
> HTML5 currently maps &lang; and &rang; to
>     U+3008 LEFT ANGLE BRACKET,
>     U+3009 RIGHT ANGLE BRACKET,
> both belonging to `CJK angle brackets' in
>     U+3000--U+303F CJK Symbols and Puntuation.
> 
> HTML 4.01 maps them to
>     U+2329 LEFT-POINTING ANGLE BRACKET,
>     U+232A RIGHT-POINTING ANGLE BRACKET
> from `Angle brackets' in the range
>     U+2300--U+23FF Miscellaneous Technical.
> 
> Unicode 5.0 notes:
> > These are discouraged for mathematical use because of their
> > canonical equivalence to CJK punctuation.
> 
> It would probably be better to use
>     U+27E8 MATHEMATICAL LEFT ANGLE BRACKET,
>     U+27E9 MATHEMATICAL RIGHT ANGLE BRACKET
> from `Mathematical brackets' in
>     U+27C0--U+27EF Miscellaneous Mathematical Symbols-A,
> characters that did not yet exist when HTML 4.01 was published.

I've made this change.


> This approach is suggested by
> http://unicode.org/Public/math/revision-09/MathMap-9.txt:
> > 27E8; lang; ISOTECH; ** # &#10216; MATHEMATICAL LEFT ANGLE BRACKET
> > 27E9; rang; ISOTECH; ** # &#10217; MATHEMATICAL RIGHT ANGLE BRACKET
> 
> Moreover, the (few) browsers I have tested render
> &lang/&rang, &#x2329/&#x232a and &#x27e8/&#x27e9 identically
> or simalarly (as "<"/">" in approximative ASCII), whereas
> &#x3008/&#x3009 are rendered as full-width East-Asian
> characters (" <"/"> ").

The browsers I tested were not at all consistent.



On Sun, 1 Jul 2007, L. David Baron wrote:
> 
> What's wrong with these mappings, and why shouldn't they also be the 
> mappings in HTML5?

On Sun, 1 Jul 2007, �istein E. Andersen wrote:
> 
> The problem is that they are canonically equivalent to CJK characters.

On Sun, 1 Jul 2007, L. David Baron wrote:
> 
> Makes sense.  I think I misread your original message.
> 
> (Although changing them at all seems a little scary.)

Well, we'd changed them anyway (since before they mapped to non-canonical 
characters); changing them to something better seems at least partially 
sensible... Browsers are pretty poor on these two entities anyway.


On Fri, 6 Jul 2007, Simon Pieters wrote:
> On Fri, 22 Jun 2007 04:19:53 +0200, Ian Hickson <ian@hixie.ch> wrote:
> > > 
> > >   <a =="">
> > > 
> > > Safari, Opera and Firefox drop the attribute. IE has an attribute 
> > > with the name being the empty string and the value being ="". The 
> > > HTML5 parsing spec says that there should be an attribute with the 
> > > name = and the value the empty string. The "Before attribute name 
> > > state" part of the parsing spec might have to be revisited.
> > 
> > I don't see any harm in leaving the spec as-is here, given the lack of 
> > interoperability and the fact that there's no real reason to be using 
> > attributes with this name anyway. Whatever's simplest to implement is 
> > probably best here.
> 
> Since it doesn't match any browser, and probably is an authoring mistake 
> (that would silently pass conformance checking in the case of <embed>), 
> could it be a parse error? (Also update the wording in the syntax 
> section if so.)

Done.


On Mon, 16 Jul 2007, Henri Sivonen wrote:
> 
> In the Data State the spec says:
> > U+0026 AMPERSAND (&)
> >     When the content model flag is set to one of the PCDATA or RCDATA
> >     states: switch to the entity data state.
> >     Otherwise: treat it as per the "anything else" entry below.
> 
> html5lib tests, WebKit trunk, Firefox 2.0.0.4 and (I've been told) IE7
> disagree. Opera 9.20 agrees with the spec, though.
> 
> To match three of the top four engines, the spec should say:
> U+0026 AMPERSAND (&)
>     When the content model flag is set to one of the PCDATA or RCDATA states
>     *and the escape flag is false*: switch to the entity data state.
>     Otherwise: treat it as per the "anything else" entry below.

Fixed.


On Mon, 23 Jul 2007, Simon Pieters wrote:
> 
> At the tokenization level, a stray ampersand is allowed if the character 
> following it is one of U+0009, U+000A, U+000B, U+000C, U+0020, U+003C, 
> U+0026, or EOF.
> 
>    http://www.whatwg.org/specs/web-apps/current-work/#consume
> 
> The syntax section says:
> 
>    An ambiguous ampersand is a U+0026 AMPERSAND (&) character that is not
>    the last character in the file, that is not followed by a space
>    character, that is not followed by a start tag that has not been
>    omitted, and that is not followed by another U+0026 AMPERSAND (&)
>    character.
> 
>    http://www.whatwg.org/specs/web-apps/current-work/#ambiguous
> 
> This doesn't catch all cases. "<" characters can also be the start of an 
> end tag, a comment, or the actual character (in the RCDATA or attribute 
> value cases). "&" characters can also be the start of a character entity 
> reference.

Fixed.


On Mon, 23 Jul 2007, Simon Pieters wrote:
> 
> Stray ampersands are allowed in some cases. Shouldn't stray less-than 
> signs ("<") be allowed in the same cases? In HTML4, both are allowed.

When should we allow them? The requirements here are pretty complicated 
already, and stray <s aren't a common error, based on Henri's data. Do we 
really want to allow them?


On Tue, 31 Jul 2007, Simon Pieters wrote:
> 
> In http://www.whatwg.org/specs/web-apps/current-work/#consume the spec 
> states that &#13; is a parse error. Is this intentional?

Yes.


> The handling of &#10;, &#13;, CRs and LFs, and their combinations, seems 
> to be a bit different in browsers.
> 
>     http://simon.html5.org/test/html/parsing/tokenisation/entities/carriage-return/demo.htm
> 
> In Opera, CRs and LFs are preserved in the DOM as they were written. CR 
> is inserted for &#13; and LF for &#10;. A CRLF pair in the DOM is 
> rendered as a single linebreak.
> 
> In IE, CRLF pairs are converted to a single CR, and the remaining LFs 
> are converted to CRs. It doesn't matter they were from real characters 
> in the input stream or NCRs.
> 
> In Safari, a LF character in the input stream is ignored if the previous 
> character was a CR (whether real or NCR). CRs (both real and NCRs) are 
> then converted to LFs. LFs are inserted for both &#10; and &#13;.
> 
> In Firefox, CRLF pairs in the input stream is converted to LF and 
> remaining CR to LF. LFs are inserted for both &#10; and &#13;.
> 
> The spec currently matches Firefox, AFAICT. Rendering-wise, there is 
> interop between IE and Opera. I think the spec should require what IE 
> does, except use LFs instead of CRs.

With the exception of treating entities like real characters, which I do 
not think is a good idea, we currently do what IE does, except with LFs 
instead of CRs. Which happens to match Firefox, yes. This seems like the 
optimal situation to me. What's the benefit of further complicating the 
newline processing by making CRLF detection happen after tokenisation 
instead of before?


On Tue, 31 Jul 2007, Simon Pieters wrote:
> 
> Aha. I didn't think of testing attributes.
> 
> Safari preserves CRs in attribute values, both real and NCRs. CRLF 
> pairs, LFCR pairs, CRs and LFs cause a single linebreak in the tooltip. 
> In data, CRs don't cause linebreaks.
> 
> For title="", IE preserves CRs in attribute values, both real and NCRs. 
> CRLF pairs, CRs and LFs in the DOM gets rendered as a signle linebreak 
> in the tooltip. For value="", all types of linebreaks are converted to 
> CRLF pairs. In data, only CRs cause linebreaks and LFs are rendered as 
> spaces.
> 
> Firefox preserves CRs in attribute values, both real and NCRs. CRs are 
> ignored and LFs are rendered as spaces in the tooltip. In data, CRs 
> don't cause linebreaks.
> 
> For title="", Opera drops LFs in attribute values, both real and NCRs, 
> and converts CRs (both real and NCRs) to spaces. For value="", CRs and 
> LFs are preserved as written, both real and NCRs.
> 
> Personally, I think attribute values should be parsed the same way as 
> data is parsed wrt linebreaks.

I agree.


On Tue, 31 Jul 2007, Philip Taylor wrote:
>
> IE undocumentedly recognises some which nobody else does:
> 
> aafs    U+206D  ACTIVATE ARABIC FORM SHAPING
> ass     U+206B  ACTIVATE SYMMETRIC SWAPPING
> iafs    U+206C  INHIBIT ARABIC FORM SHAPING
> iss     U+206A  INHIBIT SYMMETRIC SWAPPING
> lre     U+202A  LEFT-TO-RIGHT EMBEDDING
> lro     U+202D  LEFT-TO-RIGHT OVERRIDE
> nads    U+206E  NATIONAL DIGIT SHAPES
> nods    U+206F  NOMINAL DIGIT SHAPES
> pdf     U+202C  POP DIRECTIONAL FORMATTING
> rle     U+202B  RIGHT-TO-LEFT EMBEDDING
> rlo     U+202E  RIGHT-TO-LEFT OVERRIDE
> zwsp    U+200B  ZERO WIDTH SPACE
> 
> (I believe that list is complete.)
> 
> The first eleven were suggested on 
> https://listserv.heanet.ie/cgi-bin/wa?A2=ind9605&L=html-wg&P=4579 some 
> time ago but don't seem to have gone very far (except into IE).
> 
> I can see some legitimate users at 
> <http://www.tasb.com/services/field/staff/index.aspx?print=true> and 
> <http://www.pelesoft.co.il/> and maybe there's a few dozen or hundred 
> more elsewhere (but I can't measure it easily). There's some in text-art 
> at <http://yy28.60.kg/test/read.cgi/maido3/1096370177/l50> and quite a 
> lot in weird places like 
> <http://cheese.2ch.net/life/kako/1010/10103/1010391447.html> or 
> <http://zerosen52.gozaru.jp/log/1093422333.html> that I don't understand 
> but that seem to all be on 2channel (or copied from it). I've no idea 
> how common they are in general.
> 
> Are these used significantly on the web, or would they be considered 
> highly useful if anyone knew they existed, or should HTML5 just ignore 
> them?

I'm very skeptical about introducing entities for the codes that are 
redundant with dir="" and <bdo> (namely, lre, lro, pdf, rle, rlo).

I don't know enough about the others to have an educated opinion. I can 
set up a search to examine the data in more detail.


On Mon, 20 Aug 2007, Philip Taylor wrote:
> Cameron McCormack wrote:
> > Robert Burns:
> > > > I believe this is not consistent with existing browser behavior. 
> > > > That is that while surrogate pairs, expressed as pairs of numeric 
> > > > character references, are not supposed to resolve to the non-BMP 
> > > > character, browsers do it anyway.
> > 
> > Anne van Kesteren:
> > > Do you have any tests to demonstrate that?
> > 
> > Here’s one:
> > 
> >   data:text/html,%26%23xD800%3B%26%23xDC00%3B
> > 
> > Shows as a single U+10000 character in Firefox 2.0.0.5 and Opera 9.23, 
> > at least.
> 
> I also get a single character rendered in FF2, Opera 9.2, IE6, IE7 and 
> Safari 3 (Windows). I get two rendered U+FFFD characters in FF3 (build 
> 2007081904).
> 
> There's less consistency in other edge cases:
> http://software.hixie.ch/utilities/js/live-dom-viewer/?%3C!DOCTYPE%20HTML%3E%3Cp%3E0%3A%20%26%23xd800%3B%26%23xdc00%3B%3Cp%3E1%3A%20%26%23xd800%3B%3Cscript%3Edocument.write(%27\udc00%27)%3C/script%3E%3Cp%3E2%3A%20%3Cscript%3Edocument.write(%27\ud800%27)%3C/script%3E%26%23xdc00%3B%3Cp%3E3%3A%20%3Cscript%3Edocument.write(%27%26%23xd800%3B\udc00%27)%3C/script%3E
> 
> It's not obvious to me which cases should be handled in which way.

Surrogates aren't characters. They are meaningless (and should be treated 
as FFFD) outside of UTF-16. The spec is correct in its current 
description, I believe. Browsers that currently allow surrogate pairs to 
be passed in and that then treat them as UTF-16 are probably using UTF-16 
internally.


On Mon, 20 Aug 2007, Philip Taylor wrote:
> 
> Looks like the FF3 change was intentional, to fix the "Possible to 
> introduce bogus UTF-16 into Gecko" issue: 
> https://bugzilla.mozilla.org/show_bug.cgi?id=316394

Indeed.


On Sat, 1 Sep 2007, Simon Pieters wrote:
> 
> The following are parse errors:
> 
>    <!-->-->
>    <!--->-->
> 
> ...but the Writing section says that they are allowed.

Fixed.


On Sat, 1 Sep 2007, Simon Pieters wrote:
> 
> The spec has the following note:
> 
>    Note: Space characters before the root html element will be dropped
>    when the document is parsed; space characters after the root html
>    element will be parsed as if they were at the end of the html element.
>    Thus, space characters around the root element do not round-trip. It is
>    suggested that newlines be inserted after the DOCTYPE and any comments
>    that aren't in the root element.
> 
> This is not correct; space characters after the root html element will be
> parsed as if they were at the end of the *body* element.

Fixed.


> Also, if you insert newlines after comments that occur after the root 
> element then you will increase the number of newlines in the body 
> element for each round-trip. The rest of this section goes to great 
> lengths to make sure that whitespace round-trips correctly, so perhaps 
> this suggestion should be limited to comments before the root element to 
> be consistent.

Fixed.


On Sat, 1 Sep 2007, Simon Pieters wrote:
> > 
> > The spec says about optional tags:
> > 
> >     An html element's end tag may be omitted if the html element is 
> >     not immediately followed by a space character or a comment.
> > 
> > However, since spaces after the </html> tag will be inserted into the 
> > [body] element by the parser anyway, the </html> tag might well be 
> > allowed to be omitted when [the html element] is followed by space 
> > characters (but not when those in turn are followed by a comment).
> 
> This similarly applies to </body>.

Fixed both cases.


On Sat, 1 Sep 2007, Simon Pieters wrote:
>> 
>> [allowing non-entity-starting, valid, unambiguous ampersands]
>
> I'm not really fond of this change. It complicates things and makes HTML 
> harder to teach. It might also slip through authoring mistakes. I can 
> imagine that this is something that many authors would refer to as 
> "sloppy coding".
> 
> Moreover, if we are to do this then the < character should get the same 
> treatment, and we might want to allow ' and " too (e.g. the spec uses 
> "<'" in some places), which complicates things even further; the Writing 
> HTML documents section needs to handle a lot more cases, including e.g. 
> the case when the character is the last character of an attribute 
> value... actually thinking about it this is already the case for the 
> unquoted attribute syntax.
> 
> I'd rather this change was reverted.

I haven't added '<', for the reasons you give. I could remove the 
ambiguous ampersand stuff. What do other people think? Is this costing us 
more in hidden errors than it saves in needless work?


On Sat, 15 Sep 2007, Henri Sivonen wrote:
> 
> Currently, unquoted attributes may start with a =: 
> http://parsetree.validator.nu/?doc=http%3A%2F%2Fhsivonen.iki.fi%2Ftest%2Feq-eq-attr.html
> 
> This means that the notion of conformance fails to catch what is most 
> likely an error: 
> http://html5.validator.nu/?doc=http%3A%2F%2Fhsivonen.iki.fi%2Ftest%2Feq-eq-attr.html
> 
> To make the notion of conformance more useful for authors (that is, to 
> make conformance checking catch unintentional stuff), I suggest making 
> starting an unquoted attribute value with a = a parse error. This 
> wouldn't limit the expressiveness of the language as authors always have 
> the option to quote attribute values.

Done.


On Mon, 17 Sep 2007, �istein E. Andersen wrote:
> 
> An alternative solution would be to require that unquoted attribute 
> values not contain (single or double ASCII) quotes.

Done.

This removes the ability to do things like this:

   <span title=don't>do not</span>
   <img src=oneill.png title=O'Neill alt>

But oh well.


On Tue, 9 Oct 2007, Henri Sivonen wrote:
> 
> From time to time, people get confused by the disparity between the 
> content models allowed in the DOM and the expressiveness of the 
> text/html serialization. I get confused, too, sometimes even though I 
> should know this stuff.
> 
> I think element definitions should carry a note about text/html 
> limitations when the content model allowed in the DOM and 
> application/xhtml+xml differs from what text/html is able to express.

Added notes to <pre>, <table>, and <noscript>. <optgroup> and <textarea> 
aren't in the spec yet. Are there any other cases?


On Fri, 30 Nov 2007, Yuhong Bao wrote:
> 
> I agree that the HTML5 DOCTYPE should be optional, but how about 
> expanding it to the full thing like the HTML 4.01 DOCTYPE?

What would be the benefit?


On Wed, 19 Dec 2007, Almorca wrote:
>
> Hello. I am a web developer an I would like propose three new entities 
> to html 5.
> 
> I would add the entity sub1; (character U+2081), the entity sub2; 
> (character U+2082) and the entity sub3; (character U+2083). They would 
> are the equivalent to sup1; , sup2, and sup3;.

I've added a note in the markup to remind me of this suggestion. However, 
right now I'm trying to stay away from adding entities that aren't 
already supported, until such time as we've worked out what our policy is 
going to be with entities.


On Thu, 20 Dec 2007, Jirka Kosek wrote:
> 
> Please note that there is ongoing effort to harmonize entity definitions 
> between various markup vocabularies, see:
>
>   http://www.w3.org/2003/entities/

Interesting.


> Ideally, any new vocabulary like HTML5 should either don't define 
> entities at all or provide complete set as defined in:
> 
> http://www.w3.org/2003/entities/2007/w3centities-f.ent
> 
> sub1-3 are currently not defined so you should first try to convice Math 
> WG which is preparing this document. But it would be silly to have 
> different entity sets available in HTML and MathML if they can be 
> embeded inside each other.

Agreed.


On Thu, 3 Jan 2008, Anne van Kesteren wrote:
> 
> The following PUBLIC identifiers need to trigger quirks mode as well. 
> They are currently not part of HTML5, but are part of Gecko and at least 
> with the last "hotmetal" one we encountered a problem.
> 
> In lowercase:
> 
> "-//o'reilly and associates//dtd html extended relaxed 1.0//en"
> "-//softquad software//dtd hotmetal pro 6.0::19990601::extensions to html 4.0//en"
> "-//softquad//dtd hotmetal pro 4.0::19971010::extensions to html 4.0//en"
> 
> Original case:
> 
> "-//O'Reilly and Associates//DTD HTML Extended Relaxed 1.0//EN"
> "-//SoftQuad Software//DTD HoTMetaL PRO 6.0::19990601::extensions to HTML 4.0//EN"
> "-//SoftQuad//DTD HoTMetaL PRO 4.0::19971010::extensions to HTML 4.0//EN"

Added.


On Thu, 31 Jan 2008, Henri Sivonen wrote:
> > 
> > 0094 / 400	Text after �&� did not match an entity name.
> 
> Using a markup-significant character in URLs was a bad design choice, 
> but it is too late to change it. It would be great if the harmless cases 
> could be made non-errors without making stuff like &copy turning into 
> the copyright sign pass silently.
>
> I don't have a concrete suggestion at this time, though.

I agree that it would be good to magically make things silently work 
without failing to catch the mistakes -- ironically, right now we are 
complaining about the working cases and silently failing when people 
accidentally match an entity. But I don't know how to do it.


On Fri, 1 Feb 2008, Simon Pieters wrote:
> > 
> > > 0094 / 400 Text after “&” did not match an entity name.
> > 
> > Using a markup-significant character in URLs was a bad design choice, 
> > but it is too late to change it. It would be great if the harmless 
> > cases could be made non-errors without making stuff like &copy turning 
> > into the copyright sign pass silently.
> > 
> > I don't have a concrete suggestion at this time, though.
> 
>    If no match can be made, then this is a parse error. No characters are
>    consumed, and nothing is returned.
> 
> s/this is a parse error. N/n/

That hurts people trying to use an entity but failing.


On Sat, 2 Feb 2008, Anne van Kesteren wrote:
> 
> I think this is harmful as it encourages authors to rely on things we 
> might want to change. For instance, introducing the entities from MathML 
> at some point. Also, it doesn't address Henri's second point about 
> catching input errors.

On Sat, 2 Feb 2008, Simon Pieters wrote:
>
> But the MathML entities would have a required semicolon, and you don't 
> really have semicolons in URLs that would make part of it match an 
> entity... though, I haven't really made up my mind about this yet.

Both somewhat true points, but I think it is still bad that we would fail 
to tell authors unsuccessully trying to use entities that they had failed 
(especially in attributes, where it might not be immediately obvious).


On Sun, 3 Feb 2008, Simon Pieters wrote:
> 
> On Sun, 03 Feb 2008 14:28:35 +0100, Philip Taylor <pjt47@cam.ac.uk> wrote:
> 
> > On http://www.allmovie.com/cg/avg.dll?p=avg&amp;amp;amp;sql=1:162971 I also
> > find:
> > 
> > <div class="bottom_tab"><a href="/cg/avg.dll?p=avg&amp;sql=34: title="click
> > for full list"><img src="/img/nr_tab.gif" alt="full article" width="74"
> > height="20px" /></a></div>
> > 
> > which has the missing quote.
> 
> This is interesting. A validator would catch this particular case since 
> click, for, full and list" are invalid attributes for <a>, but if you're 
> unlucky the markup happens to match allowed attributes or you're using 
> <embed> where any attribute is allowed.
> 
> Some ways to improve this situation:
> 
>   * Make lack of whitespace between attributes a parse error. (Not an 
>     error in HTML4 but authors generally think it is.)

Done.


>   * Make " and ' in attribute names a parse error. (An error in HTML4.)

Done.


>   * Make the empty attribute syntax conforming only for boolean attributes.
>     (HTML4 allows minimization of enumerated attributes, but authors generally
>     think it's only allowed for boolean attributes. Moreover, IE drops src
>     and href attributes when using the empty attribute syntax.)

Not done. I think it's useful to be able to use the empty attribute syntax 
with things like title="" or alt="".


On Mon, 4 Feb 2008, Simon Pieters wrote:
> 
>    <div class=foo">
> 
> ...which probably is as simple as banning " and ' in unquoted attribute
> values.

Done, as noted above.


On Wed, 27 Feb 2008, Philip Taylor wrote:
> 
> http://www.mobile.de/ (from the Alexa Top 500 list) says:
> 
>     <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" 
> "http://www.w3.org/TR/html4/strict.dtd" />
>
> IE, Firefox and Opera (I've not tested Safari) treat that as standards 
> mode. HTML5 says it must be treated as quirks mode, since the trailing 
> slash is a syntax error and sets the 'incorrect' flag during 
> tokenisation. Is this likely to be a compatibility problem that HTML5 
> should avoid?
>
> Relatedly, http://www.gamespy.com/ says:
> 
>     <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" SYSTEM
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"/>
> 
> which is standards mode in IE, quirks mode in Firefox and Opera, and 
> quirks mode in HTML5.
>
> I see '..." />' on roughly 0.02% of pages from dmoz.org, and (excluding 
> gamespy.com) I see '..."/>' on roughly a quarter of that, so it's not a 
> very widespread issue but it does exist.

Seems to me that it comes out as a wash right now... I am tempted to leave 
it as is unless we get feedback that this is a real problem.


On Thu, 28 Feb 2008, Simon Pieters wrote:
> 
> Testing in Firefox it seems that any garbage after the system identifier 
> is basically ignored. But a trailing slash without system identifier 
> would still trigger quirks mode in Firefox:
> 
>    <!doctype html />
>    <!doctype html public "-//w3c//dtd html 4.01//en"/> 
> 
> Opera's rendering of gamespy.com would align more with Firefox if we used
> quirks mode (large list bullets).

I'm confused. I thought you said Opera did use quirks mode there.


> I think HTML5 needs to ignore garbage after the system identifier like 
> Firefox does. ("Anything else" in the after doctype system identifier 
> state should say "parse error, stay in this state", or maybe the state 
> going into the bogus doctype state should be responsible of setting the 
> correctness flag.)

Ok, done.


On Wed, 27 Feb 2008, Geoffrey Sneddon wrote:
> 
> Currently in section 8.2.4.1 The initial phase it speaks of when the 
> system identifier is missing, even though from the parser there will 
> always be a system identifier, even if it has a zero length. It should 
> be spoken of as being empty (i.e., zero length) and not missing.

On Wed, 27 Feb 2008, Anne van Kesteren wrote:
> 
> I'm pretty sure I remember the tokenizer making a difference between the 
> empty string and it being absent. Are you sure?

On Wed, 27 Feb 2008, Geoffrey Sneddon wrote:
> 
> I don't see anything (looking even closer than before) about it.

On Wed, 27 Feb 2008, Anne van Kesteren wrote:
> 
> "When a DOCTYPE token is created, its name, public identifier, and 
> system identifier must be marked as missing, ..."
> 
> "Set the DOCTYPE token's system identifier to the empty string, ..."

Yeah, this is definitely intended to distinguish empty ones from missing 
ones... what can I do to make it clearer?

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Sunday, 2 March 2008 23:02:32 UTC