- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Thu, 08 Apr 2004 03:42:35 +0200
- To: www-archive@w3.org
Hi,
<http://www.w3.org/TR/2003/WD-xhtml-print-20030729/>:
[...]
An OPTIONAL "charset" parameter MAY be provided with the MIME type.
The only valid value for the "charset" parameter is "utf-8". Invalid
values MUST be ignored and the result be as if the value were "utf-8".
[...]
I think this is most inappropriate.
* XML 1.0 says authors SHOULD use "UTF-8", not "utf-8"
* IANA charset names are case-insensitive, whether I use "UTF-8" or
"uTf-8", .. should not matter
* RFC 3236 defines legal values for the charset parameter and it
allows other values; if you think this is inappropriate, you
should update RFC 3236, XHTML Print cannot update RFC 3236.
* A document
Content-Type: application/xhtml+xml;charset=iso-8859-1
<html ...
would be considered UTF-8 while a document
Content-Type: application/xhtml+xml
<?xml version='1.0' encoding='iso-8859-1'?>
<html ...
would be considered ISO-8859-1 encoded. There is no rule in the
specification that corresponds to the "charset" behavior for the
XML declaration. This is inconsistent with basically all relevant
specifications. This thus becomes a security risk, if you process
the document with various processors, one implementation just
knows the rules for +xml documents (all application/xhtml+xml
documents and would treat the document differently, this becomes
critical if that software is expected to look for malicious content
for example. It might even become possible to do something like
Content-Type: application/xhtml+xml;charset=shift_jis
<?xml version='1.0' encoding='shift_jis'?>
...
<p>We do <span style='display:n\one'>not</span> confirm...
The browser, unaware of XHTML Print's special rules might render
We do confirm...
since it might decode the \ as the yen currency sign and the
printer will print
We do not confirm...
since it will treat the document as UTF-8 encoded and consider the
\ a backslash character.
A more typical case would be that the document is ISO-8859-1
encoded and has
Content-Type: application/xhtml+xml;charset=iso-8859-1
<?xml version='1.0' encoding='iso-8859-1'?>
<html ... <p>Björn</p>...
which would work in the browser but yield in a well-formedness
error when printing.
* it is unexpected since XHTML Print processors are required to
implement UTF-16 (or they would not be conforming XML processors
which should not be encouraged by W3C TRs), hence using this
encoding seems perfectly fine to content developers
* not so low-end printer might also implement XHTML Print plus more
character encodings, most likely US-ASCII and ISO-8859-1, since
the requirement is unexpected, they are likely to produce non-
conforming implementations (by doing what XML & Co. require...)
* This is very difficult to implement, for example the W3C MarkUp
Validator would first need to decode the document to determine the
character encoding of the document (since it needs to know whether
the document is XHTML 1.1, or XHTML Basic, ... or XHTML Print) and
then determine the encoding again, decode again, etc.
* ...
I do not mind if you want to say that XHTML Print documents must be
UTF-8 encoded and that processors must ignore documents in other
encodings (though I consider it unlikely that this will get
implemented), but the text quoted above is not acceptable.
regards.
Received on Wednesday, 7 April 2004 21:42:55 UTC