- From: Philip Taylor <excors+whatwg@gmail.com>
- Date: Wed, 5 Mar 2008 14:36:47 +0000
On 03/03/2008, Jjgod Jiang <gzjjgod at gmail.com> wrote: > During the development of CJK information processing, many > text encodings is just a strict subset of another one, for > example, GB2312 is a subset of GBK, GBK is a subset of > GB18030. For compatibility purpose, a lot of web pages used > character encoding declaration like this: > > <meta http-equiv="Content-Type" content="text/html; charset=gb2312"> > > in their header, yet they might use characters in GBK but > not in GB2312. So, I think we can suggest clients to simply > treat encodings like these as their biggest superset, for > instance, treat GB2312 as GB18030. Out of 130K pages from dmoz.org, I see 760 which are declared as gb2312 (by HTTP Content-Type, <meta content>, etc). Of those 760, 120 cause decoding errors in ICU4J when treated as gb2312. 8 cause errors when treated as gbk, and the same 8 cause errors as gb18030. Those 8 are: http://www.bigm.com.cn/dinosaur/anecdote/ http://www.ccpc.edu.cn http://www.gdoverseaschn.com.cn/ http://www.jgbr.com.cn http://www.liechebuluo.com http://www.netbro.com.cn http://www.tkdts.com http://www.wuxi-accp.com/ and I haven't tried working out why they are causing errors. The 120 are listed at <http://philip.html5.org/data/gb2312-errors.txt>. I don't know how many are really using gb18030, and how many are not actually gb* but happen to be decoded without errors because they use compatible byte sequences; but it does look like gb2312 is a fairly significant problem if it's not treated as gbk/gb18030, so it would be helpful to suggest/require it to be processed specially. -- Philip Taylor excors at gmail.com
Received on Wednesday, 5 March 2008 06:36:47 UTC