Of encodings and HTML

Consider the following document, served with a content type of
"text/html" with no encoding declaration[1]:

  <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
      <title>Some Title</title>
    </head>
    <body>Some text with UTF-8 encoded, non-ASCII characters.</body>
  </html>

Recall that the HTTP specification says that a "text/*" document
served without an encoding is by default served as US-ASCII.

1. Is an XProc processor {forbidden|allowed|required} to notice the
   content-type meta header and correct the encoding?

2. Suppose it was sent by the server as "text/html; charset=US-ASCII",
   is the processor {forbidden|allowed|required} to notice the
   content-type meta header and correct the encoding in that case?

3. Suppose the document is an HTML document, not an XHTML one. Are the
   rules different then? (I think all bets are off in that case,
   because the document must be passing through some tidy/tagsoup
   process; but what would we recommend?)

                                        Be seeing you,
                                          norm

[1] For a concrete example of this problem, see, *cough*,
    http://docs.oasis-open.org/docbook/specs/docbook-5.0-spec-os.html

-- 
Norman Walsh <ndw@nwalsh.com> | Man is the only animal who causes pain
http://nwalsh.com/            | to others with no other object than
                              | wanting to do so.-- Schopenhauer

Received on Thursday, 17 June 2010 13:39:08 UTC