Re: Of encodings and HTML from Alex Milowski on 2010-06-17 (public-xml-processing-model-wg@w3.org from June 2010)

From: Alex Milowski <alex@milowski.org>
Date: Thu, 17 Jun 2010 14:54:06 +0100
To: public-xml-processing-model-wg@w3.org
Message-ID: <AANLkTinLGlrccLiIaROK-4tiS9yCwjQXSGfc7f3quIXZ@mail.gmail.com>

On Thu, Jun 17, 2010 at 2:38 PM, Norman Walsh <ndw@nwalsh.com> wrote:
> Consider the following document, served with a content type of
> "text/html" with no encoding declaration[1]:
>
>  <html xmlns="http://www.w3.org/1999/xhtml">
>    <head>
>      <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
>      <title>Some Title</title>
>    </head>
>    <body>Some text with UTF-8 encoded, non-ASCII characters.</body>
>  </html>
>
> Recall that the HTTP specification says that a "text/*" document
> served without an encoding is by default served as US-ASCII.

Actually, it defaults to "ISO-8859-1".

>
> 1. Is an XProc processor {forbidden|allowed|required} to notice the
>   content-type meta header and correct the encoding?

No.  We processing XML. ;)

For the text/html media type, it would be a quality of implementation
feature to detect encodings set in a "meta" element.

> 2. Suppose it was sent by the server as "text/html; charset=US-ASCII",
>   is the processor {forbidden|allowed|required} to notice the
>   content-type meta header and correct the encoding in that case?

No.  If you explicitly given the character encoding on the Content-Type
header, then you should use that.

Here's the bits from the current HTML5 draft about this:

"If an HTML document does not start with a BOM, and if its encoding is
not explicitly given by Content-Type metadata, and the document is not
an iframe srcdoc document, then the character encoding used must be an
ASCII-compatible character encoding, and, in addition, if that
encoding isn't US-ASCII itself, then the encoding must be specified
using a meta element with a charset attribute or a meta element with
an http-equiv attribute in the Encoding declaration state."


> 3. Suppose the document is an HTML document, not an XHTML one. Are the
>   rules different then? (I think all bets are off in that case,
>   because the document must be passing through some tidy/tagsoup
>   process; but what would we recommend?)

All of the above rules are for text/html.

XHTML follows XML rules.  If you don't send the proper charset
parameter on the Content-Type header, then you need to sniff for
the encoding via auto-detection and the XML PI.

-- 
--Alex Milowski
"The excellence of grammar as a guide is proportional to the paucity of the
inflexions, i.e. to the degree of analysis effected by the language
considered."

Bertrand Russell in a footnote of Principles of Mathematics

Received on Thursday, 17 June 2010 13:54:40 UTC