- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Thu, 22 Apr 2010 06:25:45 +0200
- To: Sam Ruby <rubys@intertwingly.net>
- Cc: Eliot Graff <eliotgra@microsoft.com>, Adrian Bateman <adrianba@microsoft.com>, "public-html@w3.org" <public-html@w3.org>, "tag@w3.org" <tag@w3.org>, Tony Ross <tross@microsoft.com>, Paul Cotton <Paul.Cotton@microsoft.com>, "mjs@apple.com" <mjs@apple.com>, "plh@w3.org" <plh@w3.org>
Sam Ruby, Wed, 21 Apr 2010 19:14:16 -0400: > On 04/21/2010 06:15 PM, Eliot Graff wrote: ... >> If a polyglot document uses an encoding other than UTF8 or UTF16 > > UTF-16 is not valid for HTML5. I would recommend being more > prescriptive: simply recomment (or even require) utf-8 as it is the > only encoding guaranteed to be supported by all HTML and XML parsers. +1 That would probably make many use it only for that reason! One of the technical nails in the coffin could be the very fact that the draft currently says that one SHOULD use the XML declaration whenever one uses an encoding *other* than UTF-8/UTF-16. Whereas we know that the XML declarations triggers Quirks-Mode in IE6. The XML declaration will also trigger QuirksMode in IE7 and IE8, *provided* that you use a formatting of the XML declaration like this: <?xml version="1.0" encoding="utf-8"?> That is: If the first character after the initial '<?xml' happens to be a line break, then even IE7 and IE8 enters into Quirks Mode. Thus: HTML cannot live up to the "SHOULD" which XHTML and/or XML has w.r.t. use of the XML declaration whenever non-UTF-8/UTF-16 encodings are used. And hence we have a technical reason to forbid it. PS: I hope that technical limitations rather than "this is simpler for authors" will guide the speccing of this spec. It should define a common denominator for HTML5 and XHTMl5. But not anything more strict than that. E.g. I would like to know when I can use a minimized '<p />' *and* get the same DOM in both XHTML and HTML, rather than having a "simple" rule which requires me to *always* avoid the minimized <p />. >> You must specify attribute values as lowercase. > > This needs to be made more specific. A few lines after this, you > provide a counter-example: <img src="karen.jpg" alt="Karen" /> And please say "letters" in instead of "values". ;-) And also, the lowercase issue, for those attributes that it matters, then it only matters for ASCII letters. And not to e.g. Greek, Cyrillic or non-ASCII Latin letters. Another issue is attribute *names*, and especially the data-* attribute. When it comes to the data-* attribute, then in text/html, data-FOO="" and data-foo="" will be treated as the same attribute (they will be ASCII-lower cased). Hence, uppercase ASCII characters must be forbidden in the the "-foo" part of the data-* attribute. However, for all non-ASCII "-foo" names, there is no need to forbid uppercase letters. >> You should use only the following named entity references > > This should either become a MUST, or this document needs to cover > what DOCTYPES are acceptable. I would recomment going with MUST. If we go for a MUST w.r.t. the set of named entity references, do you then say that we can say that other DOCTYPEs than <!DOCTYPE html> are acceptable as well? (Before saying "yes!", I would like to know if I understood ...) PS: a note to Eliot: the text about the DOCTYPE says: ]] For a polyglot document, you must use the HTML DOCTYPE. [[ And then you point to the HTML5 spec. However, you need to make clear that the 'html' part of the DOCTYPE must be lowercased. It must be <!DOCTYPE html> rather than <!DOCTYPE HTML>. Also, a note about what you say about PIs and the XML declaration: ]] You must not use processing instructions in a polyglot document. [ … ] If a polyglot document uses an encoding other than UTF8 or UTF16, you should include the XML declaration; [ … ] [[ The XML declaration is a PI - at least according to its syntax. Thus it is a bit strange to say that PIs MUST NOT occur. And thereafter to say that they SHOULD occur if the document is not in a UTF-8/UTF-16 encoding. Again: This is a reason to a) say that UTF-8 is the only encoding and b) to forbid the XML declaration. Then you can also drop to say anything about PIs. If you want to say anything about PIs, then you should do so in a context where you speak about content *inside* the HTML element, rather than when you speak about the content before or inside the DOCTYPE. (In addition to the XML declaration itself being a PI, PIs are also permitted inside the very DOCTYPE declaration of XHTML documents - but before I eventually express my opinion on that issue, I would at last like to know if we accept other DOCTYPEs than <!DOCTYPE html>.) >> You should include a space before the trailing / and > of empty >> elements, e.g. <br />, <hr /> > > I haven't found this to be necessary. +1 Me neither. >> Also, you should use the minimized tag syntax for empty elements, >> e.g. <br />. The alternative syntax <br></br> allowed by XML gives >> uncertain results in many existing user agents. > > I would recommend that this be a MUST. The specific example you cite > will produce different DOMs with HTML5 and XML1 parsers. Justification? Can I assume that the MUST was meant only for "this specific example"? Based on the facts on the ground, it is possible to establish a very detailed set of rules for when to allow and when to not allow - and *how* to allow - the use of minimized tag syntax - and that is what I would like to have. For instance: * for some legacy elements, like the 'br' element (which else?), the - of course - the syntax MUST be minimized. But this is the exceptions. If we make it a MUST to use minimized syntax for other elements than 'br' (and other exceptions), then this must be justified by pointing to other issues than technical ones ... * for other legacy elements, like 'meta', 'img', 'embed' and 'param', then both minimized and full syntax work equally well, in both HTML and XHTML. The only requirement should be the XHTML1 requirement that the end tag follows immediately after the opening tag. (Hence, <img></img> is OK. But not <img><!----></img>. And not <img> </img>.) * For completely new elements, then <newEmptyElement></newEmptyElement> has better text/html compatibility than <newEmptyElement /> * At the same time, it can be acceptable to use minimized syntax even for <newEmptyElement />, provided that the end tag of a supported (aka 'legacy') parent element follows immediately after the new, minimized element (as this will close the new minimized element even in text/html). I suggest the same rule as for what is permitted between '<img>' and '</img>' in XHTML, namely: There must not be any white-space or any HTML comments or anything else between the minimized element and the closing tag of the parent element. This rule is even good for the minimized <p /> element - and, to a very limited degree - for the minimized <script /> element. (See below.) >> Given an empty instance of an element whose content model is not >> EMPTY (for example, an empty title or paragraph) do not use the >> minimized form (e.g. use <p> </p> and not <p />). > > Would suggest the use of RFC 2119 language (MUST not), and I suggest > that the example be changed to <script src="..."> as this is an > example that is particularly problematic. Justification? When it comes to <p />, then it can be be permitted, provided that one operates with requirements that are similar to those which are necessary for new, empty elements written with the non-minimized syntax: a minimized <p /> must be immediately followed by the end tag of its parent tag. Though, for <p />, then it should also permitted to write a minimized <p /> whenever the <p /> is immediately followed by another 'p' element. Thus, for <p />, then this can be allowed: 1 <div><p /><p>foo</p></div> 2 <div><p /></div> But not this: 3 <div><p /><!----><p>foo</p></div> 4 <div><p /><!----></div> 5 <div><p /> </div> (Since, example 3-5 create a different DOM in XHTML vs in HTML.) When it comes to minimized <script />, then, again, the same rules as for new, empty elements, can be followed, except that the parent element cannot be *any* parent, but only certain specific ones, such as <iframe>. Hence this should probably be permitted: 1 <iframe src="_"><script src="_" /></frame> 2 <div><script src="_" /></div></body> 3 <p><script src="_" /></p></body> But not this: 4 <iframe src="_"><script src="_" /><p></p></frame> 5 <div><script src="_" /></div><p></p></body> 6 <p><script src="_" /></p><p></p></body> -- leif halvard silli
Received on Thursday, 22 April 2010 04:26:27 UTC