Re: Charset usage data from Alexey Proskuryakov on 2008-07-22 (public-html@w3.org from July 2008)

From: Alexey Proskuryakov <ap@webkit.org>
Date: Tue, 22 Jul 2008 15:44:18 +0400
To: Philip Taylor <pjt47@cam.ac.uk>
Cc: Ian Hickson <ian@hixie.ch>, HTML WG <public-html@w3.org>
Message-Id: <5DE90963-8D2B-43F0-91D1-9AABCF8C1D4B@webkit.org>

On May 22, 2008, at 4:32 PM, Philip Taylor wrote:

>>> The encoding sniffing algorithm works significantly better with  
>>> 1024 bytes (finds 92% of charsets) that with 512 (finds 82%). If  
>>> anyone cares, I could try a more detailed comparison to see if  
>>> there's a 'good' value that could be suggested to UA developers,  
>>> since the 512 bytes used as an example in the spec is not great.
>> As far as I can tell, 512 bytes is the sweet spot after which you  
>> get diminishing returns (you got 80% with 512, but doubling it only  
>> got you an extra 10%).
>
> But on the other hand, doubling it got a huge 50% decrease in false  
> negatives :-)
> (Seems like it's just a tradeoff that can be interpreted however you  
> want, and I've got no idea what would be best in practice, and 512  
> doesn't sound less reasonable than anything else.)


FWIW, WebKit has just switched to checking 1024 bytes instead of 512  
(and we ignore charset declarations outside of HEAD past that boundary).

- WBR, Alexey Proskuryakov

Received on Tuesday, 22 July 2008 11:45:29 UTC