- From: Ian Hickson <ian@hixie.ch>
- Date: Thu, 22 May 2008 11:17:54 +0000 (UTC)
- To: Philip Taylor <pjt47@cam.ac.uk>
- Cc: HTML WG <public-html@w3.org>
On Wed, 5 Mar 2008, Philip Taylor wrote: > > The meta content charset-extraction mechanism gets confused by some > legitimate code like: > > http://www.modellbausieghard.de/ - <meta http-equiv="content-style-type" > content="text/css; charset=iso-8859-1" /> That's not legitimate. You can't set the encoding for the value of the style="" attribute. > The algorithm for extracting encodings from content-types should be > changed to stop before a semicolon, since "windows-1252;" and "utf-8;" > show up occasionally. Ideally it would support any valid HTTP > Content-Type header, in particular where charset is not the first > parameter, though I didn't find any examples of pages like that. Do browser support charset="" being anything but the first parameter? I've added ; to the list of characters that get skipped. > Some charsets have a quite high level of invalid content. gb2312 is > mentioned already in > <http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2008-March/014127.html>. > euc-jp and tis-620 seem high too - it'd be interesting if someone could > see whether that was due to mislabelling, or using a superset encoding, > or just broken content. I've added the TIS-620 to Windows-874 mapping. Does that mapping help your numbers? Should we convert iso-8859-11 (ISO superset of TIS-620) to Win874 too? I haven't done anything with EUC-JP; I don't know what to map it to to do anything useful with it. > The encoding sniffing algorithm works significantly better with 1024 > bytes (finds 92% of charsets) that with 512 (finds 82%). If anyone > cares, I could try a more detailed comparison to see if there's a 'good' > value that could be suggested to UA developers, since the 512 bytes used > as an example in the spec is not great. As far as I can tell, 512 bytes is the sweet spot after which you get diminishing returns (you got 80% with 512, but doubling it only got you an extra 10%). -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Thursday, 22 May 2008 11:18:33 UTC