- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Fri, 05 Oct 2007 19:56:32 +0900
- To: ietf-http-wg@w3.org
- Cc: "Richard Ishida" <ishida@w3.org>, Felix Sasaki <fsasaki@w3.org>
Dear HTTP experts, Here is another issue that apparently hasn't yet been listed. The HTTP spec, in section 3.7.1, currently claims that for subtypes of the media type "text", there is a default of iso-8859-1. In actual practice, this is, at best, wishful thinking. It may also pretty much look like it's actually true if you are in Western Europe or in the Americas, but it doesn't apply world-wide. There are tons of Web sites in Asia (and Asia is home to more than half of the World's population) that have no charset, and that are not in iso-8859-1. And browsers in these regions don't expect pages to be iso-8859-1. In addition, there is the case of text/xml, and text/*+xml, where the default is US-ASCII. So the text below should be changed to say that data in all character sets SHOULD be labeled, and move the default to historic. Some adequate adjustments should also be made to Section 3.4.1. I'll gladly help with word-smithing. For reference, the current draft says: http://www.w3.org/Protocols/HTTP/1.1/rfc2616bis/draft-lafon-rfc2616bis-latest.html#canonicalization.and.text.defaults >>>> The "charset" parameter is used with some media types to define the character set (Section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See Section 3.4.1 for compatibility problems. >>>> I got alerted to this problem again because I was just trying to use the 'charset' method of the Ruby OpenURI::Meta class (http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/, unfortunately with frames not up to Web Architecture standards, so you have to click a few times). It says: >>>> charset() {|| ...} returns a charset parameter in Content-Type field. It is downcased for canonicalization. If charset parameter is not given but a block is given, the block is called and its result is returned. It can be used to guess charset. If charset parameter and block is not given, nil is returned except text type in HTTP. In that case, "iso-8859-1" is returned as defined by RFC2616 3.7.1. >>>> (a block is a ruby-specific construct, which can be ignored in this context) The fact that the implementers read the HTTP spec and implemented it as written means that for any practical use, I can just forget about the 'charset' method and have to write my own workaround. The best conclusion from this is that the spec is wrong and should be fixed to match better with reality. Many thanks in advance, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Friday, 5 October 2007 11:14:31 UTC