Re: Charset usage data

On May 22, 2008, at 4:32 PM, Philip Taylor wrote:

>>> The encoding sniffing algorithm works significantly better with  
>>> 1024 bytes (finds 92% of charsets) that with 512 (finds 82%). If  
>>> anyone cares, I could try a more detailed comparison to see if  
>>> there's a 'good' value that could be suggested to UA developers,  
>>> since the 512 bytes used as an example in the spec is not great.
>> As far as I can tell, 512 bytes is the sweet spot after which you  
>> get diminishing returns (you got 80% with 512, but doubling it only  
>> got you an extra 10%).
>
> But on the other hand, doubling it got a huge 50% decrease in false  
> negatives :-)
> (Seems like it's just a tradeoff that can be interpreted however you  
> want, and I've got no idea what would be best in practice, and 512  
> doesn't sound less reasonable than anything else.)


FWIW, WebKit has just switched to checking 1024 bytes instead of 512  
(and we ignore charset declarations outside of HEAD past that boundary).

- WBR, Alexey Proskuryakov

Received on Tuesday, 22 July 2008 11:45:29 UTC