W3C home > Mailing lists > Public > public-html@w3.org > May 2008

Re: Charset usage data

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Thu, 22 May 2008 13:32:32 +0100
Message-ID: <483567E0.9040808@cam.ac.uk>
To: Ian Hickson <ian@hixie.ch>
CC: HTML WG <public-html@w3.org>

Ian Hickson wrote:
> Do browser support charset="" being anything but the first parameter?


All I tested (IE6/7, FF2/3, Opera 9.2/9.5, Safari 3) support 
content="text/html; foo=bar; charset=...; baz=quux" (i.e. act 
differently depending on the value of "...")

>> Some charsets have a quite high level of invalid content. gb2312 is 
>> mentioned already in 
>> <http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-March/014127.html>. 
>> euc-jp and tis-620 seem high too - it'd be interesting if someone could 
>> see whether that was due to mislabelling, or using a superset encoding, 
>> or just broken content.
> I've added the TIS-620 to Windows-874 mapping. Does that mapping help your 
> numbers?

It helps my numbers, but that's because all byte sequences are valid in 
Windows-874 (or at least in ICU4J's implementation of it), so I can't 
tell whether it's actually correct or not.

http://philip.html5.org/data/charsets.html#charset-tis-620 lists the 
invalid TIS-620 pages I found, in case anybody wants to check how they 
ought to be interpreted.

>> The encoding sniffing algorithm works significantly better with 1024 
>> bytes (finds 92% of charsets) that with 512 (finds 82%). If anyone 
>> cares, I could try a more detailed comparison to see if there's a 'good' 
>> value that could be suggested to UA developers, since the 512 bytes used 
>> as an example in the spec is not great.
> As far as I can tell, 512 bytes is the sweet spot after which you get 
> diminishing returns (you got 80% with 512, but doubling it only got you an 
> extra 10%).

But on the other hand, doubling it got a huge 50% decrease in false 
negatives :-)
(Seems like it's just a tradeoff that can be interpreted however you 
want, and I've got no idea what would be best in practice, and 512 
doesn't sound less reasonable than anything else.)

Philip Taylor
Received on Thursday, 22 May 2008 12:33:13 UTC

This archive was generated by hypermail 2.4.0 : Saturday, 9 October 2021 18:44:31 UTC