Re: Charset usage data

Ian Hickson wrote:
> Do browser support charset="" being anything but the first parameter?

http://philip.html5.org/demos/html/charset-parsing/

All I tested (IE6/7, FF2/3, Opera 9.2/9.5, Safari 3) support 
content="text/html; foo=bar; charset=...; baz=quux" (i.e. act 
differently depending on the value of "...")

>> Some charsets have a quite high level of invalid content. gb2312 is 
>> mentioned already in 
>> <http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-March/014127.html>. 
>> euc-jp and tis-620 seem high too - it'd be interesting if someone could 
>> see whether that was due to mislabelling, or using a superset encoding, 
>> or just broken content.
> 
> I've added the TIS-620 to Windows-874 mapping. Does that mapping help your 
> numbers?

It helps my numbers, but that's because all byte sequences are valid in 
Windows-874 (or at least in ICU4J's implementation of it), so I can't 
tell whether it's actually correct or not.

http://philip.html5.org/data/charsets.html#charset-tis-620 lists the 
invalid TIS-620 pages I found, in case anybody wants to check how they 
ought to be interpreted.

>> The encoding sniffing algorithm works significantly better with 1024 
>> bytes (finds 92% of charsets) that with 512 (finds 82%). If anyone 
>> cares, I could try a more detailed comparison to see if there's a 'good' 
>> value that could be suggested to UA developers, since the 512 bytes used 
>> as an example in the spec is not great.
> 
> As far as I can tell, 512 bytes is the sweet spot after which you get 
> diminishing returns (you got 80% with 512, but doubling it only got you an 
> extra 10%).

But on the other hand, doubling it got a huge 50% decrease in false 
negatives :-)
(Seems like it's just a tradeoff that can be interpreted however you 
want, and I've got no idea what would be best in practice, and 512 
doesn't sound less reasonable than anything else.)

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Thursday, 22 May 2008 12:33:13 UTC