- From: Boris Zbarsky <bzbarsky@MIT.EDU>
- Date: Tue, 07 Sep 2010 09:27:51 -0400
On 9/7/10 9:16 AM, Philip J?genstedt wrote: > UTF-8, Big5 and GBK are all (as far as I know) ASCII supersets. Do > real-world text documents include \0 bytes? Yes. Real-world text documents include all sorts of gunk. Just rarely. >> As long as "indicates an encoding" doesn't include UTF-8 or ISO-8859-1 >> (thanks, Apache!), that should be reasonable, I think. > > Are you saying that Apache has, at various times, set the default > character encoding to UTF-8 or ISO-8859-1? Yes, precisely. Though the UTF-8 stuff was Linux distros, I think, not Apache itself (in that Apache just sent the thing passed to AddDefaultCharset and they changed the value of that from ISO-8859-1 to UTF-8 in their distro packages). Here's the relevant comment from the Gecko source where we do our text-or-binary sniffing for toplevel contexts: Make sure to do a case-sensitive exact match comparison here. Apache 1.x just sends text/plain for "unknown", while Apache 2.x sends text/plain with a ISO-8859-1 charset. Debian's Apache version, just to be different, sends text/plain with iso-8859-1 charset. For extra fun, FC7, RHEL4, and Ubuntu Feisty send charset=UTF-8. Don't do general case-insensitive comparison, since we really want to apply this crap as rarely as we can. > I was hoping that no encoding parameter at all would be sent :/ Heh. I've long since given up all hope of reason on this stuff; I just try to keep it as sane and predictable and simple as possible. :( -Boris
Received on Tuesday, 7 September 2010 06:27:51 UTC