W3C home > Mailing lists > Public > whatwg@whatwg.org > March 2008

[whatwg] A comment to character encoding declaration

From: Philip Taylor <excors+whatwg@gmail.com>
Date: Wed, 5 Mar 2008 14:36:47 +0000
Message-ID: <ea09c0d10803050636x5552d6bqdbdf5c057d998a75@mail.gmail.com>
On 03/03/2008, Jjgod Jiang <gzjjgod at gmail.com> wrote:
>  During the development of CJK information processing, many
>  text encodings is just a strict subset of another one, for
>  example, GB2312 is a subset of GBK, GBK is a subset of
>  GB18030. For compatibility purpose, a lot of web pages used
>  character encoding declaration like this:
>  <meta http-equiv="Content-Type" content="text/html; charset=gb2312">
>  in their header, yet they might use characters in GBK but
>  not in GB2312. So, I think we can suggest clients to simply
>  treat encodings like these as their biggest superset, for
>  instance, treat GB2312 as GB18030.

Out of 130K pages from dmoz.org, I see 760 which are declared as
gb2312 (by HTTP Content-Type, <meta content>, etc).

Of those 760, 120 cause decoding errors in ICU4J when treated as
gb2312. 8 cause errors when treated as gbk, and the same 8 cause
errors as gb18030.

Those 8 are:
and I haven't tried working out why they are causing errors.

The 120 are listed at
<http://philip.html5.org/data/gb2312-errors.txt>. I don't know how
many are really using gb18030, and how many are not actually gb* but
happen to be decoded without errors because they use compatible byte
sequences; but it does look like gb2312 is a fairly significant
problem if it's not treated as gbk/gb18030, so it would be helpful to
suggest/require it to be processed specially.

Philip Taylor
excors at gmail.com
Received on Wednesday, 5 March 2008 06:36:47 UTC

This archive was generated by hypermail 2.3.1 : Monday, 13 April 2015 23:08:40 UTC