- From: Philip Taylor <pjt47@cam.ac.uk>
- Date: Thu, 22 May 2008 13:32:32 +0100
- To: Ian Hickson <ian@hixie.ch>
- CC: HTML WG <public-html@w3.org>
Ian Hickson wrote: > Do browser support charset="" being anything but the first parameter? http://philip.html5.org/demos/html/charset-parsing/ All I tested (IE6/7, FF2/3, Opera 9.2/9.5, Safari 3) support content="text/html; foo=bar; charset=...; baz=quux" (i.e. act differently depending on the value of "...") >> Some charsets have a quite high level of invalid content. gb2312 is >> mentioned already in >> <http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-March/014127.html>. >> euc-jp and tis-620 seem high too - it'd be interesting if someone could >> see whether that was due to mislabelling, or using a superset encoding, >> or just broken content. > > I've added the TIS-620 to Windows-874 mapping. Does that mapping help your > numbers? It helps my numbers, but that's because all byte sequences are valid in Windows-874 (or at least in ICU4J's implementation of it), so I can't tell whether it's actually correct or not. http://philip.html5.org/data/charsets.html#charset-tis-620 lists the invalid TIS-620 pages I found, in case anybody wants to check how they ought to be interpreted. >> The encoding sniffing algorithm works significantly better with 1024 >> bytes (finds 92% of charsets) that with 512 (finds 82%). If anyone >> cares, I could try a more detailed comparison to see if there's a 'good' >> value that could be suggested to UA developers, since the 512 bytes used >> as an example in the spec is not great. > > As far as I can tell, 512 bytes is the sweet spot after which you get > diminishing returns (you got 80% with 512, but doubling it only got you an > extra 10%). But on the other hand, doubling it got a huge 50% decrease in false negatives :-) (Seems like it's just a tradeoff that can be interpreted however you want, and I've got no idea what would be best in practice, and 512 doesn't sound less reasonable than anything else.) -- Philip Taylor pjt47@cam.ac.uk
Received on Thursday, 22 May 2008 12:33:13 UTC