W3C home > Mailing lists > Public > public-html@w3.org > July 2008

Re: Charset usage data

From: Alexey Proskuryakov <ap@webkit.org>
Date: Tue, 22 Jul 2008 15:44:18 +0400
Cc: Ian Hickson <ian@hixie.ch>, HTML WG <public-html@w3.org>
Message-Id: <5DE90963-8D2B-43F0-91D1-9AABCF8C1D4B@webkit.org>
To: Philip Taylor <pjt47@cam.ac.uk>

On May 22, 2008, at 4:32 PM, Philip Taylor wrote:

>>> The encoding sniffing algorithm works significantly better with  
>>> 1024 bytes (finds 92% of charsets) that with 512 (finds 82%). If  
>>> anyone cares, I could try a more detailed comparison to see if  
>>> there's a 'good' value that could be suggested to UA developers,  
>>> since the 512 bytes used as an example in the spec is not great.
>> As far as I can tell, 512 bytes is the sweet spot after which you  
>> get diminishing returns (you got 80% with 512, but doubling it only  
>> got you an extra 10%).
> But on the other hand, doubling it got a huge 50% decrease in false  
> negatives :-)
> (Seems like it's just a tradeoff that can be interpreted however you  
> want, and I've got no idea what would be best in practice, and 512  
> doesn't sound less reasonable than anything else.)

FWIW, WebKit has just switched to checking 1024 bytes instead of 512  
(and we ignore charset declarations outside of HEAD past that boundary).

- WBR, Alexey Proskuryakov
Received on Tuesday, 22 July 2008 11:45:29 UTC

This archive was generated by hypermail 2.3.1 : Thursday, 29 October 2015 10:15:35 UTC