W3C home > Mailing lists > Public > public-xml-processing-model-wg@w3.org > June 2010

Of encodings and HTML

From: Norman Walsh <ndw@nwalsh.com>
Date: Thu, 17 Jun 2010 09:38:31 -0400
To: public-xml-processing-model-wg@w3.org
Message-ID: <m21vc5wzhk.fsf@nwalsh.com>
Consider the following document, served with a content type of
"text/html" with no encoding declaration[1]:

  <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
      <title>Some Title</title>
    </head>
    <body>Some text with UTF-8 encoded, non-ASCII characters.</body>
  </html>

Recall that the HTTP specification says that a "text/*" document
served without an encoding is by default served as US-ASCII.

1. Is an XProc processor {forbidden|allowed|required} to notice the
   content-type meta header and correct the encoding?

2. Suppose it was sent by the server as "text/html; charset=US-ASCII",
   is the processor {forbidden|allowed|required} to notice the
   content-type meta header and correct the encoding in that case?

3. Suppose the document is an HTML document, not an XHTML one. Are the
   rules different then? (I think all bets are off in that case,
   because the document must be passing through some tidy/tagsoup
   process; but what would we recommend?)

                                        Be seeing you,
                                          norm

[1] For a concrete example of this problem, see, *cough*,
    http://docs.oasis-open.org/docbook/specs/docbook-5.0-spec-os.html

-- 
Norman Walsh <ndw@nwalsh.com> | Man is the only animal who causes pain
http://nwalsh.com/            | to others with no other object than
                              | wanting to do so.-- Schopenhauer

Received on Thursday, 17 June 2010 13:39:08 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 17 June 2010 13:39:08 GMT