Re: Is the P-word? (Was: TAG Decision on Rescinding the request to the HTML WG to develop a polyglot guide) from David Sheets on 2013-01-23 (public-html@w3.org from January 2013)

From: David Sheets <kosmo.zb@gmail.com>
Date: Wed, 23 Jan 2013 15:13:23 -0800
To: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Cc: Henri Sivonen <hsivonen@iki.fi>, Daniel Glazman <daniel@glazman.org>, Sam Ruby <rubys@intertwingly.net>, Noah Mendelsohn <nrm@arcanedomain.com>, "www-tag@w3.org List" <www-tag@w3.org>, "public-html@w3.org" <public-html@w3.org>
Message-ID: <CAAWM5Twbip0EHsUX78DEQbaVd1UTekACtFDM6zjyh9=DjgZ1TQ@mail.gmail.com>

On Wed, Jan 23, 2013 at 1:11 AM, Leif Halvard Silli
<xn--mlform-iua@xn--mlform-iua.no> wrote:
> David Sheets, Tue, 22 Jan 2013 21:18:00 -0800:
>
>> What is the reason that
>> <http://dev.w3.org/html5/html-xhtml-author-guide/#content-type> says
>>
>> <blockquote>
>> The HTTP Content-Type: header has no extra rules or restrictions,
>> whereas polyglot markup does not use the http-equiv="Content-Type"
>> declaration on the meta element.
>> </blockquote>
>
> The Polyglot Markup spec limits itself to define a subset of the HTML5
> spec, which permits meta@charset=UTF-8 in both XHTML code and HTML
> code, whereas the HTML5 spec only permits meta@http-equiv in HTML code.

Are you referring to
<http://www.w3.org/html/wg/drafts/html/master/document-metadata.html#attr-meta-http-equiv-content-type>?

See below for the operational details that makes these prescriptive
statements pointless.

>> This suggests to me that putting something like
>>
>> <meta http-equiv="Content-Type" content="application/xhtml+xml" />
>
> A case could be made for allowing 'text/html;charset=UTF-8' in XHTML5
> since meta@charset has somewhat limited support outside the GUI browser
> world. For instance, Microsoft Word and Open Office doesn't support
> <meta charset="UTF-8"/>. Which, I have to admit, feels like a pain in
> polyglot’s robustness principle ass. ;-) But then again: If you
> export/download a Google Docs document (from Google Drive) as HTML, you
> will find that it contains no encoding declaration (and no DOCTYPE for
> that matter) - all the non-ASCII is converted to numerical character
> entities.
>
>> is a potential way to indicate to text/html consumers that this
>> representation is also parseable by an XML parser and interpretable by
>> an XHTML renderer.
>>
>> Is this ill-advised for some reason? Is there a pitfall here of which
>> I am ignorant?
>>
>> It would be nice to embed useful metadata indicating that the present
>> representation is intended to have identical semantics under different
>> media types' interpretations. This would give multi-modal consumers a
>> means to leverage both HTML and XML processing on the document if so
>> instructed.
>
> If you meant that one could include two meta based encoding decalraiton
> elements in the same document, then HTML5 forbids that as well.
> http://www.w3.org/html/wg/drafts/html/master/document-metadata.html#charset

This would not be an encoding or charset declaration. This would be a
piece of embedded metadata stating that the author's intent is that
the containing representation can be interpreted identically under
text/html and application/xhtml+xml.

The HTML5 spec says
<http://www.w3.org/html/wg/drafts/html/master/document-metadata.html#attr-meta-http-equiv-content-type>:

<blockquote>
The Encoding declaration state is just an alternative form of setting
the charset attribute: it is a character encoding declaration. This
state's user agent requirements are all handled by the parsing section
of the specification.
</blockquote>

which, I believe, refers to
<http://www.w3.org/html/wg/drafts/html/master/infrastructure.html#algorithm-for-extracting-a-character-encoding-from-a-meta-element>:

<blockquote>
Loop: Find the first seven characters in s after position that are an
ASCII case-insensitive match for the word "charset". If no such match
is found, return nothing and abort these steps.
</blockquote>

which indicates to me that <meta http-equiv="Content-Type"
content="application/xhtml+xml" /> would be put in the DOM under both
HTML5 and XHTML5, would not interfere with charset detection, and
would be benign. Non-HTTP HTML consumers can interpret the
representation as text/html and non-HTTP XHTML consumers can interpret
the representation as XHTML. When this representation is served, the
server may extract this embedded metadata to decide how to serve the
document.

Do you know of any specific subsystems that fail if this is done? Do
the HTML and XML DOMs diverge? Despite what the "normative" prose in
HTML5 says, the algorithms contained in the spec don't appear to care
about meta/@http-equiv which does not specify a 'charset' media type
parameter.

This tag seems to be the most appropriate for expressing the
polyglot-ness of an (X)HTML document. Maybe there is another way to
declare this authorial intent, however.

<!DOCTYPE html> implies text/html conformance
<meta http-equiv="Content-Type" content="application/xhtml+xml" />
implies application/xhtml+xml conformance

Thoughts?

David

Received on Wednesday, 23 January 2013 23:18:10 UTC