Charset usage data from Philip Taylor on 2008-03-05 (public-html@w3.org from March 2008)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Wed, 05 Mar 2008 17:52:53 +0000
To: HTML WG <public-html@w3.org>
Message-ID: <47CEDDF5.6090100@cam.ac.uk>
I've got some data about charsets at 
http://philip.html5.org/data/charsets.html (based on 125K pages from 
dmoz.org).

Some random comments:


The meta content charset-extraction mechanism gets confused by some 
legitimate code like:

http://www.modellbausieghard.de/ - <meta http-equiv="content-style-type" 
content="text/css; charset=iso-8859-1" />

But this is pretty rare, and it's much more common for 
http-equiv="content-type" to suffer from typos, so it seems more 
reliable to ignore the value of http-equiv (which is what HTML5 
currently does). People can always send the charset in HTTP Content-Type 
if they don't want UAs to possibly-erroneously parse their meta tags. 
Maybe it's worth warning authors about this.


The algorithm for extracting encodings from content-types should be 
changed to stop before a semicolon, since "windows-1252;" and "utf-8;" 
show up occasionally. Ideally it would support any valid HTTP 
Content-Type header, in particular where charset is not the first 
parameter, though I didn't find any examples of pages like that.


Some charsets have a quite high level of invalid content. gb2312 is 
mentioned already in 
<http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-March/014127.html>. 
  euc-jp and tis-620 seem high too - it'd be interesting if someone 
could see whether that was due to mislabelling, or using a superset 
encoding, or just broken content.


The encoding sniffing algorithm works significantly better with 1024 
bytes (finds 92% of charsets) that with 512 (finds 82%). If anyone 
cares, I could try a more detailed comparison to see if there's a 'good' 
value that could be suggested to UA developers, since the 512 bytes used 
as an example in the spec is not great.


The encoding sniffing algorithm doesn't get many false positives. I've 
not looked in any detail, but the first several seem to have a charset 
declaration inside a <style> or inside an unclosed attribute string, so 
the pages are already pretty broken and the encoding sniffer doesn't 
stand much of a chance.


http://jellybelly.com/International/Japanese/home.html is interesting as 
an example that fails (gets interpreted as something like iso-8859-1) in 
Opera and WebKit. I think HTML5 interprets it correctly.


The charset "none" is reasonably popular, and came solely from Apache 
servers (whereas only 67% of all 125K pages were Apache). I guess people 
have been doing "AddDefaultCharset none" instead of "AddDefaultCharset Off".


For some encodings, like shift_jis and windows-1250, people much prefer 
to use HTML (<meta>) than HTTP (Content-Type). For some others, like 
windows-1251 and iso-8859-15, HTTP is used much more often. I have no 
idea why there's such a difference.


-- 
Philip Taylor
pjt47@cam.ac.uk
Received on Wednesday, 5 March 2008 17:53:17 UTC