- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Thu, 08 Apr 2004 03:42:35 +0200
- To: www-archive@w3.org
Hi, <http://www.w3.org/TR/2003/WD-xhtml-print-20030729/>: [...] An OPTIONAL "charset" parameter MAY be provided with the MIME type. The only valid value for the "charset" parameter is "utf-8". Invalid values MUST be ignored and the result be as if the value were "utf-8". [...] I think this is most inappropriate. * XML 1.0 says authors SHOULD use "UTF-8", not "utf-8" * IANA charset names are case-insensitive, whether I use "UTF-8" or "uTf-8", .. should not matter * RFC 3236 defines legal values for the charset parameter and it allows other values; if you think this is inappropriate, you should update RFC 3236, XHTML Print cannot update RFC 3236. * A document Content-Type: application/xhtml+xml;charset=iso-8859-1 <html ... would be considered UTF-8 while a document Content-Type: application/xhtml+xml <?xml version='1.0' encoding='iso-8859-1'?> <html ... would be considered ISO-8859-1 encoded. There is no rule in the specification that corresponds to the "charset" behavior for the XML declaration. This is inconsistent with basically all relevant specifications. This thus becomes a security risk, if you process the document with various processors, one implementation just knows the rules for +xml documents (all application/xhtml+xml documents and would treat the document differently, this becomes critical if that software is expected to look for malicious content for example. It might even become possible to do something like Content-Type: application/xhtml+xml;charset=shift_jis <?xml version='1.0' encoding='shift_jis'?> ... <p>We do <span style='display:n\one'>not</span> confirm... The browser, unaware of XHTML Print's special rules might render We do confirm... since it might decode the \ as the yen currency sign and the printer will print We do not confirm... since it will treat the document as UTF-8 encoded and consider the \ a backslash character. A more typical case would be that the document is ISO-8859-1 encoded and has Content-Type: application/xhtml+xml;charset=iso-8859-1 <?xml version='1.0' encoding='iso-8859-1'?> <html ... <p>Björn</p>... which would work in the browser but yield in a well-formedness error when printing. * it is unexpected since XHTML Print processors are required to implement UTF-16 (or they would not be conforming XML processors which should not be encouraged by W3C TRs), hence using this encoding seems perfectly fine to content developers * not so low-end printer might also implement XHTML Print plus more character encodings, most likely US-ASCII and ISO-8859-1, since the requirement is unexpected, they are likely to produce non- conforming implementations (by doing what XML & Co. require...) * This is very difficult to implement, for example the W3C MarkUp Validator would first need to decode the document to determine the character encoding of the document (since it needs to know whether the document is XHTML 1.1, or XHTML Basic, ... or XHTML Print) and then determine the encoding again, decode again, etc. * ... I do not mind if you want to say that XHTML Print documents must be UTF-8 encoded and that processors must ignore documents in other encodings (though I consider it unlikely that this will get implemented), but the text quoted above is not acceptable. regards.
Received on Wednesday, 7 April 2004 21:42:55 UTC