[outdated] XHTML Print: "invalid" charset specifications from Bjoern Hoehrmann on 2004-04-08 (www-archive@w3.org from April 2004)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Thu, 08 Apr 2004 03:42:35 +0200
To: www-archive@w3.org
Message-ID: <407fad85.21285647@smtp.bjoern.hoehrmann.de>
Hi,

  <http://www.w3.org/TR/2003/WD-xhtml-print-20030729/>:

[...]
  An OPTIONAL "charset" parameter MAY be provided with the MIME type.
  The only valid value for the "charset" parameter is "utf-8". Invalid
  values MUST be ignored and the result be as if the value were "utf-8".
[...]

I think this is most inappropriate.

  * XML 1.0 says authors SHOULD use "UTF-8", not "utf-8"

  * IANA charset names are case-insensitive, whether I use "UTF-8" or
    "uTf-8", .. should not matter

  * RFC 3236 defines legal values for the charset parameter and it
    allows other values; if you think this is inappropriate, you
    should update RFC 3236, XHTML Print cannot update RFC 3236.

  * A document

      Content-Type: application/xhtml+xml;charset=iso-8859-1

      <html ...

    would be considered UTF-8 while a document

      Content-Type: application/xhtml+xml

      <?xml version='1.0' encoding='iso-8859-1'?>
      <html ...

    would be considered ISO-8859-1 encoded. There is no rule in the
    specification that corresponds to the "charset" behavior for the
    XML declaration. This is inconsistent with basically all relevant
    specifications. This thus becomes a security risk, if you process
    the document with various processors, one implementation just
    knows the rules for +xml documents (all application/xhtml+xml
    documents and would treat the document differently, this becomes
    critical if that software is expected to look for malicious content
    for example. It might even become possible to do something like

      Content-Type: application/xhtml+xml;charset=shift_jis

      <?xml version='1.0' encoding='shift_jis'?>
      ...
      <p>We do <span style='display:n\one'>not</span> confirm...

    The browser, unaware of XHTML Print's special rules might render

      We do confirm...

    since it might decode the \ as the yen currency sign and the
    printer will print

      We do not confirm...

    since it will treat the document as UTF-8 encoded and consider the
    \ a backslash character.

    A more typical case would be that the document is ISO-8859-1
    encoded and has

      Content-Type: application/xhtml+xml;charset=iso-8859-1

      <?xml version='1.0' encoding='iso-8859-1'?>
      <html ... <p>Björn</p>...

    which would work in the browser but yield in a well-formedness
    error when printing.

  * it is unexpected since XHTML Print processors are required to
    implement UTF-16 (or they would not be conforming XML processors
    which should not be encouraged by W3C TRs), hence using this
    encoding seems perfectly fine to content developers

  * not so low-end printer might also implement XHTML Print plus more
    character encodings, most likely US-ASCII and ISO-8859-1, since
    the requirement is unexpected, they are likely to produce non-
    conforming implementations (by doing what XML & Co. require...)

  * This is very difficult to implement, for example the W3C MarkUp
    Validator would first need to decode the document to determine the
    character encoding of the document (since it needs to know whether
    the document is XHTML 1.1, or XHTML Basic, ... or XHTML Print) and
    then determine the encoding again, decode again, etc.

  * ...

I do not mind if you want to say that XHTML Print documents must be
UTF-8 encoded and that processors must ignore documents in other
encodings (though I consider it unlikely that this will get
implemented), but the text quoted above is not acceptable.

regards.
Received on Wednesday, 7 April 2004 21:42:55 UTC