Re: Request for Volunteers: Polyglot spec

Sam Ruby, Wed, 21 Apr 2010 19:14:16 -0400:
> On 04/21/2010 06:15 PM, Eliot Graff wrote:
>> If a polyglot document uses an encoding other than UTF8 or UTF16
> UTF-16 is not valid for HTML5.  I would recommend being more 
> prescriptive: simply recomment (or even require) utf-8 as it is the 
> only encoding guaranteed to be supported by all HTML and XML parsers.

+1  That would probably make many use it only for that reason!

One of the technical nails in the coffin could be the very fact that 
the draft currently says that one SHOULD use the XML declaration 
whenever one uses an encoding *other* than UTF-8/UTF-16. Whereas we 
know that the XML declarations triggers Quirks-Mode in IE6. The XML 
declaration will also trigger QuirksMode in IE7 and IE8, *provided* 
that you use a formatting of the XML declaration like this:

version="1.0" encoding="utf-8"?>

That is: If the first character after the initial '<?xml' happens to be 
a line break, then even IE7 and IE8 enters into Quirks Mode. Thus: HTML 
cannot live up to the "SHOULD" which XHTML and/or XML has w.r.t. use of 
the XML declaration whenever non-UTF-8/UTF-16 encodings are used. And 
hence we have a technical reason to forbid it.

PS: I hope that technical limitations rather than "this is simpler for 
authors" will guide the speccing of this spec. It should define a 
common denominator for HTML5 and XHTMl5. But not anything more strict 
than that. E.g. I would like to know when I can use a minimized '<p />' 
*and* get the same DOM in both XHTML and HTML, rather than having a 
"simple" rule which requires me to *always*  avoid the minimized <p />.

>> You must specify attribute values as lowercase.
> This needs to be made more specific.  A few lines after this, you
> provide a counter-example: <img src="karen.jpg" alt="Karen" />

And please say "letters" in instead of "values". ;-)

And also, the lowercase issue, for those attributes that it matters, 
then it only matters for ASCII letters. And not to e.g. Greek, Cyrillic 
or non-ASCII Latin letters.

Another issue is attribute *names*, and especially the data-* 
attribute. When it comes to the data-* attribute, then in text/html, 
data-FOO="" and data-foo="" will be treated as the same attribute (they 
will be ASCII-lower cased). Hence, uppercase ASCII characters must be 
forbidden in the the "-foo" part of the data-* attribute. However, for 
all non-ASCII "-foo" names, there is no need to forbid uppercase 

>> You should use only the following named entity references
> This should either become a MUST, or this document needs to cover 
> what DOCTYPES are acceptable.  I would recomment going with MUST.

If we go for a MUST w.r.t. the set of named entity references, do you 
then say that we can say that other DOCTYPEs than <!DOCTYPE html> are 
acceptable as well? (Before saying "yes!", I would like to know if I 
understood ...)

PS: a note to Eliot: the text about the DOCTYPE says:

For a polyglot document, you must use the HTML DOCTYPE.

And then you point to the HTML5 spec. However, you need to make clear 
that the 'html' part of the DOCTYPE must be lowercased. It must be 
<!DOCTYPE html> rather than <!DOCTYPE HTML>.

Also, a note about what you say about PIs and the XML declaration:

You must not use processing instructions in a polyglot document. [ … ] 
If a polyglot document uses an encoding other than UTF8 or UTF16, you 
should include the XML declaration; [ … ]

The XML declaration is a PI - at least according to its syntax. Thus it 
is a bit strange to say that PIs MUST NOT occur. And thereafter to say 
that they SHOULD occur if the document is not in a UTF-8/UTF-16 
encoding. Again: This is a reason to a) say that UTF-8 is the only 
encoding and b) to forbid the XML declaration. Then you can also drop 
to say anything about PIs. 

If you want to say anything about PIs, then you should do so in a 
context where you speak about content *inside* the HTML element, rather 
than when you speak about the content before or inside the DOCTYPE. (In 
addition to the XML declaration itself being a PI, PIs are also 
permitted inside the very DOCTYPE declaration of XHTML documents - but 
before I eventually express my opinion on that issue, I would at last 
like to know if we accept other DOCTYPEs than <!DOCTYPE html>.)

>> You should include a space before the trailing / and > of empty
>> elements, e.g. <br />, <hr />
> I haven't found this to be necessary.

+1 Me neither.
>> Also, you should use the minimized tag syntax for empty elements,
>> e.g. <br />. The alternative syntax <br></br> allowed by XML gives
>> uncertain results in many existing user agents.
> I would recommend that this be a MUST.  The specific example you cite 
> will produce different DOMs with HTML5 and XML1 parsers.

Justification? Can I assume that the MUST was meant only for "this 
specific example"? 

Based on the facts on the ground, it is possible to establish a very 
detailed set of rules for when to allow and when to not allow - and 
*how* to allow - the use of minimized tag syntax - and that is what I 
would like to have. For instance:

* for some legacy elements, like the 'br' element (which else?), the - 
of course - the syntax MUST be minimized. But this is the exceptions. 
If we make it a MUST to use minimized syntax for other elements than 
'br' (and other exceptions), then this must be justified by pointing to 
other issues than technical ones ...

* for other legacy elements, like 'meta', 'img', 'embed' and 'param', 
then both minimized and full syntax work equally well, in both HTML and 
XHTML. The only requirement should be the XHTML1 requirement that the 
end tag follows immediately after the opening tag. (Hence, <img></img> 
is OK. But not <img><!----></img>. And not <img>   </img>.)

* For completely new elements, then <newEmptyElement></newEmptyElement> 
has better text/html compatibility than <newEmptyElement /> 

* At the same time, it can be acceptable to use minimized syntax even 
for <newEmptyElement />, provided that the end tag of a supported (aka 
'legacy') parent element follows immediately after the new, minimized 
element (as this will close the new minimized element even in 
text/html). I suggest the same rule as for what is permitted between 
'<img>' and '</img>' in XHTML, namely: There must not be any 
white-space or any HTML comments or anything else between the minimized 
element and the closing tag of the parent element. This rule is even 
good for the minimized <p /> element - and, to a very limited degree - 
for the minimized <script /> element. (See below.)

>> Given an empty instance of an element whose content model is not
>> EMPTY (for example, an empty title or paragraph) do not use the
>> minimized form (e.g. use <p> </p> and not <p />).
> Would suggest the use of RFC 2119 language (MUST not), and I suggest 
> that the example be changed to <script src="..."> as this is an 
> example that is particularly problematic.


When it comes to <p />, then it can be be permitted, provided that one 
operates with requirements that are similar to those which are 
necessary for new, empty elements written with the non-minimized 
syntax: a minimized <p /> must be immediately followed by the end tag 
of its parent tag. Though, for <p />, then it should also permitted to 
write a minimized <p /> whenever the <p /> is immediately followed by 
another 'p' element.

Thus, for <p />, then this can be allowed:

 1 <div><p /><p>foo</p></div> 
 2 <div><p /></div>

But not this: 

 3 <div><p /><!----><p>foo</p></div>
 4 <div><p /><!----></div>
 5 <div><p />    </div> 

(Since, example 3-5 create a different DOM in XHTML vs in HTML.)

When it comes to minimized <script />, then, again, the same rules as 
for new, empty elements, can be followed, except that the parent 
element cannot be *any* parent, but only certain specific ones, such as 

Hence this should probably be permitted:

 1 <iframe src="_"><script src="_" /></frame>
 2 <div><script src="_" /></div></body>
 3 <p><script src="_" /></p></body>

But not this:

 4 <iframe src="_"><script src="_" /><p></p></frame>
 5 <div><script src="_" /></div><p></p></body>
 6 <p><script src="_" /></p><p></p></body>
leif halvard silli  

Received on Thursday, 22 April 2010 04:26:27 UTC