Re: Request for Volunteers: Polyglot spec from Leif Halvard Silli on 2010-04-24 (public-html@w3.org from April 2010)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Sat, 24 Apr 2010 21:09:58 +0200
To: Sam Ruby <rubys@intertwingly.net>
Cc: Eliot Graff <eliotgra@microsoft.com>, Adrian Bateman <adrianba@microsoft.com>, "public-html@w3.org" <public-html@w3.org>, "tag@w3.org" <tag@w3.org>, Tony Ross <tross@microsoft.com>, Paul Cotton <Paul.Cotton@microsoft.com>, "mjs@apple.com" <mjs@apple.com>, "plh@w3.org" <plh@w3.org>
Message-ID: <20100424210958417411.eb99aefb@xn--mlform-iua.no>

Sam Ruby, Wed, 21 Apr 2010 19:14:16 -0400:
> On 04/21/2010 06:15 PM, Eliot Graff wrote:
>> Today, I uploaded an EARLY draft version of a polyglot spec,
>> "HTML/XHTML Compatibility Authoring Guidelines." [1]
> 
> A few QUICK comments:
> 
>> If a polyglot document uses an encoding other than UTF8 or UTF16
> 
> UTF-16 is not valid for HTML5.  I would recommend being more 
> prescriptive: simply recomment (or even require) utf-8 as it is the 
> only encoding guaranteed to be supported by all HTML and XML parsers.

Regarding the META element. Draft says:

]] You SHOULD use the HTML meta tag to specify 
[[ character and coding in the document.

Depending on what kind of specification this is gonna be ..., then the 
META should be a MUST. For round tripping reasons. 

E.g. take an accidental page from your blog, Sam:
http://intertwingly.net/blog/2010/04/22/Restoring-floatflt-sty

It doesn't use META. And, alas, despite the correct MIME type, some 
UAs/tools - even XHTML compatible ones (such as Webkit based iCab, but 
not Safari) - save your page to a disk with '.html' as suffix. (Opera 
goes "overbaord" and save as .xml).  Regardless how it happens, when on 
disk as .html, then tools/UAs *may* default the locale encoding, as 
specced in HTML5. Problem e.g. seen in SeaMonkey's Composer and - now 
and then - in its cousin, KompoZer. If both UTF-8/UTF-16 *and* the META 
charset element was required, then there would seldom be encoding 
problems.

But on the other side:

[[
4. Namespaces
The following guidelines apply to namespaces used in polyglot documents.
  * The <html> element must have the namespace declaration 
xmlns="http://www.w3.org/1999/xhtml". [etc]
[[

Firstly, a nit: Why say "guidelines", and then subsequently  say "MUST 
have namespace declaration"?

Secondly: Can we expect *benefits* inside text/html from using these 
namespaces? *If* a particular namespace, or something else, counted as 
a "polyglot document identifier", then to not treat it as an UTF-8 or 
UTF-16 file, could be considered as an error.

Such a flag could also be used to prevent another common problem today: 
Tools and UAs which "normalize" XHTML syntax into HTML4 compatible 
syntax. 

Real world experience:

Gecko based SeaMonkey's Composer reads Sam's blog as XHTML, but  
converts the syntax to HTML4 syntax *and* insert a META charset  - 
without "/>"-  with the correct encoding into the document. Same thing 
happens when opening a saved version of Sam's page, regardless of 
.xhtml or .html suffix.

Another tool, Gecko based KompoZer opens the online version of Sam's 
page fine. And saves it as .xhtml (well, honestly, it is quite ad lib 
with what it does w.r.t. suffix). Subsequently it refuses to re-open 
the document, because it converted it to HTML4 syntax - it simply 
prompts an alert saying "This is not a HTML document". Despite that its 
preferences are set to retain the source code. 

DOCTYPE gotcha in KompoZer: For a 'file.html' with XHTML1 doctype, then 
KompoZer does NOT "normalize" "/>" to ">". But if the MIME type is 
<!DOCTYPE html>, then it *does* do that. (SeaMonkey composer does it 
regardless.) 

Such silent conversion from XHTML syntax to HTML4 syntax is a common 
problem. I have also had it in W3's own Amaya, occasionally. Though 
Amaya has tools for converting between syntaxes, so it is much less of 
a problem there. 

To put the above another way: We are looking to create a spec which 
requires XHTML tools to produce "Appendix 5" compatible XHTML. 
Effectively, XHTML tools must learn a new dialects of XHTML. But could 
we also flag these files in such a way that even text/HTML tools are 
*required* to not "normalize" the code of such files to HTML4 
compatible syntax? I.E. could we require text/html tools to know two 
dialects of HTML syntax?

So, what do we need? A new DOCTYPE which requires text/html user 
agents, not to save well formed XHTML, but to not "normalize" the 
syntax into HTML4-ish HTML? Or can the, the XHTML namespace talisman 
serve this purpose? Or must we simply give up?

The "Appendix 5" spec could emphasize that it is an error to save 
application/xhtml+xml served pages with the file suffix .xhtml, no? But 
on the other side, if it is a polyglot spec, why should it require that 
pages are saved the one way or the other?
-- 
leif halvard silli

Received on Saturday, 24 April 2010 19:10:36 UTC