NEW ISSUE: NO default charset from Martin Duerst on 2007-10-05 (ietf-http-wg@w3.org from October to December 2007)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Fri, 05 Oct 2007 19:56:32 +0900
To: ietf-http-wg@w3.org
Cc: "Richard Ishida" <ishida@w3.org>, Felix Sasaki <fsasaki@w3.org>
Message-Id: <6.0.0.20.2.20071005192948.032fda90@localhost>

Dear HTTP experts,

Here is another issue that apparently hasn't yet been listed.
The HTTP spec, in section 3.7.1, currently claims that for
subtypes of the media type "text", there is a default of iso-8859-1.

In actual practice, this is, at best, wishful thinking. It may also
pretty much look like it's actually true if you are in Western Europe
or in the Americas, but it doesn't apply world-wide. There are tons of
Web sites in Asia (and Asia is home to more than half of the World's
population) that have no charset, and that are not in iso-8859-1.
And browsers in these regions don't expect pages to be iso-8859-1.

In addition, there is the case of text/xml, and text/*+xml, where
the default is US-ASCII.

So the text below should be changed to say that data in all character
sets SHOULD be labeled, and move the default to historic. Some adequate
adjustments should also be made to Section 3.4.1. I'll gladly help with
word-smithing.

For reference, the current draft says:
http://www.w3.org/Protocols/HTTP/1.1/rfc2616bis/draft-lafon-rfc2616bis-latest.html#canonicalization.and.text.defaults

>>>>
The "charset" parameter is used with some media types to define the character set (Section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See Section 3.4.1 for compatibility problems.
>>>>

I got alerted to this problem again because I was just trying to use
the 'charset' method of the Ruby OpenURI::Meta class
(http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/, unfortunately
with frames not up to Web Architecture standards, so you have to click
a few times). It says:

>>>>
charset() {|| ...}

returns a charset parameter in Content-Type field. It is downcased for canonicalization.

If charset parameter is not given but a block is given, the block is called and its result is returned. It can be used to guess charset.

If charset parameter and block is not given, nil is returned except text type in HTTP. In that case, "iso-8859-1" is returned as defined by RFC2616 3.7.1.
>>>>
(a block is a ruby-specific construct, which can be ignored in this
context)

The fact that the implementers read the HTTP spec and implemented it as
written means that for any practical use, I can just forget about the
'charset' method and have to write my own workaround. The best conclusion
from this is that the spec is wrong and should be fixed to match better
with reality.

Many thanks in advance, Martin.

#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

Received on Friday, 5 October 2007 11:14:31 UTC