Re: Polyglot Markup/XML encoding declaration from Leif Halvard Silli on 2010-07-28 (public-html@w3.org from July 2010)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Wed, 28 Jul 2010 20:28:27 +0300
To: Henri Sivonen <hsivonen@iki.fi>
Cc: HTMLwg <public-html@w3.org>, Eliot Graff <eliotgra@microsoft.com>, public-i18n-core@w3.org
Message-ID: <20100728202827205286.4c0283a6@xn--mlform-iua.no>

Henri Sivonen, Mon, 26 Jul 2010 11:33:38 +0300:
> On Jul 23, 2010, at 17:26, Leif Halvard Silli wrote:
> 
>> Proposal: Polyglot Markup should allow the document encoding to be set 
>> via the encoding attribute of the XML declaration.
> 
> I strongly object to proposals that either make syntax looking like 
> an XML declaration conforming in HTML

It must be a literal XML declaration - no lookalike. So, what you are 
really after, is to _change_ the current situation, where text/html 
permits the XML declaration, via XHTML 1.0, Appendix C. For example the 
W3 sponsored editor Amaya by default both inserts the XML declaration 
*and* uses the .html file suffix. The current state of affairs when it 
comes Appendix C polyglots, is that it is permitted. The cat is out of 
the sack 11 years ago.

> or that extend the HTML charset 
> sniffing in any way that uses polyglotness as the rationale.

I agree that UAs should not have to sniff. And if both meta@charset and 
XML declaration are present, then there will be no extension of the 
encoding sniffing. It would be fully in the tradition of polyglot 
documents to require both a HTML-compatible method and a XML-compatible 
method for setting the encoding. Just consider xml:lang and lang. Thus 
a simple rule: If you use the XML encoding declaration, then a 
equivalent meta@charset element is a MUST. If this rules out <?xml 
version="1.0" encoding="UTF-16" ?> since <meta charset="UTF-16"/> is 
forbidden, then that's OK. 

> This stuff is complex enough as it is.

Complexity evaluation is outside the pure spec inference task.

One of the complexities is how to tell a HTML tool or a XML tool to 
produce polyglot syntax instead of native only syntax.

Here is my idea:  Currently, meta@charset is meaningless in XHTML. But 
what if XHTML tools interpreted it as a signal to produce polyglot 
syntax?  The presence of XML declaration could play the same role for 
HTML tools. Though, really, it is the presence of both artifacts that 
should be polyglot indicator: the meta@charset together with xml 
declaration should be a quite certain signal to both XHTML and HTML 
tools and authors.

E.g. a typical polyglot - UTF-8 encoded, that is - could start like 
this:

]]
<?xml version="1.0" ?>
<!DOCTYPE html>
<head>
  <meta charset="UTF-8"/>
[[

This is why we should discuss the XML declaration and the XML encoding 
declaration separately. And, btw, if you would like to suggest another 
way to discern polyglot documents from HTML and XML documents, then I 
am all ear!
-- 
leif halvard silli

Received on Thursday, 29 July 2010 12:51:52 UTC