- From: Philip Taylor <pjt47@cam.ac.uk>
- Date: Wed, 05 Mar 2008 17:52:53 +0000
- To: HTML WG <public-html@w3.org>
I've got some data about charsets at http://philip.html5.org/data/charsets.html (based on 125K pages from dmoz.org). Some random comments: The meta content charset-extraction mechanism gets confused by some legitimate code like: http://www.modellbausieghard.de/ - <meta http-equiv="content-style-type" content="text/css; charset=iso-8859-1" /> But this is pretty rare, and it's much more common for http-equiv="content-type" to suffer from typos, so it seems more reliable to ignore the value of http-equiv (which is what HTML5 currently does). People can always send the charset in HTTP Content-Type if they don't want UAs to possibly-erroneously parse their meta tags. Maybe it's worth warning authors about this. The algorithm for extracting encodings from content-types should be changed to stop before a semicolon, since "windows-1252;" and "utf-8;" show up occasionally. Ideally it would support any valid HTTP Content-Type header, in particular where charset is not the first parameter, though I didn't find any examples of pages like that. Some charsets have a quite high level of invalid content. gb2312 is mentioned already in <http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-March/014127.html>. euc-jp and tis-620 seem high too - it'd be interesting if someone could see whether that was due to mislabelling, or using a superset encoding, or just broken content. The encoding sniffing algorithm works significantly better with 1024 bytes (finds 92% of charsets) that with 512 (finds 82%). If anyone cares, I could try a more detailed comparison to see if there's a 'good' value that could be suggested to UA developers, since the 512 bytes used as an example in the spec is not great. The encoding sniffing algorithm doesn't get many false positives. I've not looked in any detail, but the first several seem to have a charset declaration inside a <style> or inside an unclosed attribute string, so the pages are already pretty broken and the encoding sniffer doesn't stand much of a chance. http://jellybelly.com/International/Japanese/home.html is interesting as an example that fails (gets interpreted as something like iso-8859-1) in Opera and WebKit. I think HTML5 interprets it correctly. The charset "none" is reasonably popular, and came solely from Apache servers (whereas only 67% of all 125K pages were Apache). I guess people have been doing "AddDefaultCharset none" instead of "AddDefaultCharset Off". For some encodings, like shift_jis and windows-1250, people much prefer to use HTML (<meta>) than HTTP (Content-Type). For some others, like windows-1251 and iso-8859-15, HTTP is used much more often. I have no idea why there's such a difference. -- Philip Taylor pjt47@cam.ac.uk
Received on Wednesday, 5 March 2008 17:53:17 UTC