Re: Of encodings and HTML from Henry S. Thompson on 2010-06-17 (public-xml-processing-model-wg@w3.org from June 2010)

From: Henry S. Thompson <ht@inf.ed.ac.uk>
Date: Thu, 17 Jun 2010 16:07:45 +0100
To: Norman Walsh <ndw@nwalsh.com>
Cc: public-xml-processing-model-wg@w3.org
Message-ID: <f5baaqtlmta.fsf@calexico.inf.ed.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Norman Walsh writes:

> Consider the following document, served with a content type of
> "text/html" with no encoding declaration[1]:
>
>   <html xmlns="http://www.w3.org/1999/xhtml">
>     <head>
>       <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
>       <title>Some Title</title>
>     </head>
>     <body>Some text with UTF-8 encoded, non-ASCII characters.</body>
>   </html>
>
> Recall that the HTTP specification says that a "text/*" document
> served without an encoding is by default served as US-ASCII.
>
> 1. Is an XProc processor {forbidden|allowed|required} to notice the
>    content-type meta header and correct the encoding?

In what situation?  

  1) When passed the URI for the doc't on the command line?  It can do
     anything it likes, as the spec. explicitly doesn't constrain how
     the primary pipeline input is acquired;

  2) When addressed via <p:document href="..."/> or <p:load
     href="..."/>?  Hmmm.  We sort of blew that, in that the spec. is
     silent as to how the Content-Type header plays wrt the
     requirement that the retrieved represention be "a well-formed XML
     document" [1].  What if the Content-Type were image/jpeg ? Should
     you go ahead and try to parse it as XML anyway?

     Assuming the answer is 'yes', then I think the situation is clear
     -- RFC3023 [2] says explicitly that in the case of text/xml, if
     there is no Charset, then you _must_ assume US-ASCII:

     "This example shows text/xml with the charset parameter omitted.
      In this case, MIME and XML processors MUST assume the charset is
      "us- ascii", the default charset value for text media types
      specified in [RFC2046].  The default of "us-ascii" holds even if
      the text/xml entity is transported using HTTP.

     "Omitting the charset parameter is NOT RECOMMENDED for text/xml.
      For example, even if the contents of the XML MIME entity are
      UTF-16 or UTF-8, or the XML MIME entity has an explicit encoding
      declaration, XML and MIME processors MUST assume the charset is
      "us-ascii"."

     I think we have to take this as covering _all_ text/... media types.

     In which case, the result will be a mess -- some XML processors
     (e.g. Saxon) treat UTF-8 bytes as errors, others (rxp) accept
     them as random 8-bit chars.

Out of time, sending this for the call.

> [1] For a concrete example of this problem, see, *cough*,
>     http://docs.oasis-open.org/docbook/specs/docbook-5.0-spec-os.html

I get an XML declaration from that URI, which is a case not covered by
your question???

ht

[1] http://www.w3.org/TR/xproc/#p.document
[2] http://tools.ietf.org/html/rfc3023#section-8.5
- -- 
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 651-1426, e-mail: ht@inf.ed.ac.uk
                       URL: http://www.ltg.ed.ac.uk/~ht/
 [mail from me _always_ has a .sig like this -- mail without it is forged spam]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFMGjpBkjnJixAXWBoRAtGlAKCAeMw6eIpiiyp/8654tzx3h2dyMwCeJtlL
eZK45hCnH+HkbPzdpUK+dWM=
=Xbgz
-----END PGP SIGNATURE-----

Received on Thursday, 17 June 2010 15:08:20 UTC