[Bug 28156] New: Separate GBK and GB18030 even for decoding (toUnicode) from bugzilla@jessica.w3.org on 2015-03-06 (www-international@w3.org from January to March 2015)

From: <bugzilla@jessica.w3.org>
Date: Fri, 06 Mar 2015 18:48:33 +0000
To: www-international@w3.org
Message-ID: <bug-28156-4285@http.www.w3.org/Bugs/Public/>

https://www.w3.org/Bugs/Public/show_bug.cgi?id=28156

            Bug ID: 28156
           Summary: Separate GBK and GB18030 even for decoding (toUnicode)
           Product: WHATWG
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Encoding
          Assignee: annevk@annevk.nl
          Reporter: jshin@chromium.org
        QA Contact: sideshowbarker+encodingspec@gmail.com
                CC: mike@w3.org, www-international@w3.org

After bug 27235, GBK and GB18030 are distinct when encoding (fromUnicode). 

I guess the rationale for treating GBK and GB18030 identically when decodidng
(toUnicode) is that there are (significant) number of pages that are actually
in GB18030 but are mislabelled as GBK. 

I wonder if there's any statistics collected for that. I'm curious to know what
percentage of documents labelled as GBK are actually in GB18030. My suspicion
is that it's pretty low especially compared with 'ISO-8859-1 vs windows-1252',
'EUC-KR vs windows-949' (because it's so prevalent that the spec's EUC-KR is
actually windows-949, which I fully support), 'TIS 620 : ISO-8859-11 :
windows-864', and so forth. 

I'm raising this issue because 1) Blink, Webkit, Firefox (and I guess, IE, too)
have treated two encodings separately  2) Blink need to add extra code to treat
GBK/GB18030 as specified in the current spec. 

I believe that it's doable (I thought about how to do that yesterday), but I'm
not convinced that it's worth the effort / extra code.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Received on Friday, 6 March 2015 18:48:35 UTC