Re: Charset usage data

On Wed, 5 Mar 2008, Philip Taylor wrote:
> 
> The meta content charset-extraction mechanism gets confused by some 
> legitimate code like:
> 
> http://www.modellbausieghard.de/ - <meta http-equiv="content-style-type" 
> content="text/css; charset=iso-8859-1" />

That's not legitimate. You can't set the encoding for the value of the 
style="" attribute.


> The algorithm for extracting encodings from content-types should be 
> changed to stop before a semicolon, since "windows-1252;" and "utf-8;" 
> show up occasionally. Ideally it would support any valid HTTP 
> Content-Type header, in particular where charset is not the first 
> parameter, though I didn't find any examples of pages like that.

Do browser support charset="" being anything but the first parameter?

I've added ; to the list of characters that get skipped.


> Some charsets have a quite high level of invalid content. gb2312 is 
> mentioned already in 
> <http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-March/014127.html>. 
> euc-jp and tis-620 seem high too - it'd be interesting if someone could 
> see whether that was due to mislabelling, or using a superset encoding, 
> or just broken content.

I've added the TIS-620 to Windows-874 mapping. Does that mapping help your 
numbers?

Should we convert iso-8859-11 (ISO superset of TIS-620) to Win874 too?

I haven't done anything with EUC-JP; I don't know what to map it to to do 
anything useful with it.


> The encoding sniffing algorithm works significantly better with 1024 
> bytes (finds 92% of charsets) that with 512 (finds 82%). If anyone 
> cares, I could try a more detailed comparison to see if there's a 'good' 
> value that could be suggested to UA developers, since the 512 bytes used 
> as an example in the spec is not great.

As far as I can tell, 512 bytes is the sweet spot after which you get 
diminishing returns (you got 80% with 512, but doubling it only got you an 
extra 10%).

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Thursday, 22 May 2008 11:18:33 UTC