[whatwg] Encoding: big5 and big5-hkscs from Philip Jägenstedt on 2012-04-12 (public-whatwg-archive@w3.org from April 2012)

From: Philip Jägenstedt <philipj@opera.com>
Date: Thu, 12 Apr 2012 09:26:51 +0200
Message-ID: <op.wcm5m1t5sr6mfa@kirk>
On Mon, 09 Apr 2012 03:08:20 +0200, ?istein E. Andersen <liszt at coq.no>  
wrote:

> On 8 Apr 2012, at 18:03, Philip J?genstedt wrote:
>
>> On Sat, 07 Apr 2012 16:04:55 +0200, ?istein E. Andersen <liszt at coq.no>  
>> wrote:

>>> Suggested change:  map C6CD to U+5E7A.
>>
>> These are the existing mappings:
>>
>> C6CD =>
>> opera-hk: U+2F33 ?
>> firefox: U+5E7A ?
>> chrome: U+F6DD ?
>> firefox-hk: U+5E7A ?
>> opera: U+2F33 ?
>> chrome-hk: U+2F33 ?
>> internetexplorer: U+F6DD ?
>> hkscs-2008: <U+2F33> ?
>>
>> At least on the Web, this isn't a question of HK vs non-HK mappings.  
>> Other than Firefox, which (de-facto) specs or implementations use  
>> U+5E7A?
>
> I have now had a closer look at my notes  
> (<http://coq.no/character-tables/chinese-traditional/en>). My argument  
> for U+5E7A goes as follows:
>
> Of the 214 Kangxi radicals, 186 appear (as normal Han character) in CNS  
> 11643 Planes 1 or 2, whereas 25 appear in Plane 3 and 3 are missing  
> altogether.  Big5 only covers Planes 1 and 2, which means that 28 Kangxi  
> radicals (which may be rare in running text, but are nevertheless  
> important) are missing.  The E-Ten 1 extension encodes 25 of the missing  
> radicals in the range C6BF--C6D7.  Unlike CNS 11643 and Unicode, Big5  
> does not encode radicals twice (as radicals and normal characters).   
> This means that Big5 with the E-Ten 1 extension contains 211 of the 214  
> Kangxi radicals, all mapped to normal Han characters, and no codepoints  
> mapped to Unicode Kangxi Radicals in the range U+2F00--U+2FD5.
>
> In summary:  although E-Ten 1 was not defined in terms of Unicode, it is  
> clear that the 25 radicals were all meant to map to normal Han  
> characters, not to the special radical characters found in CNS 11643 and  
> Unicode.
>
> Enter HKSCS.  20 of the E-Ten 1 Kangxi radical mappings (along with the  
> rest of E-Ten 1 and E-Ten 2, or almost) are adopted, but the remaining 5  
> are instead given new codepoints elsewhere.  Whatever the reason be, 4  
> of the 5 unused E-Ten positions are simply left undefined in the HKSCS  
> standard, which is not much of a problem for a unified HK/non-HK Big5  
> encoding.  Unfortunately, the position C6CD was not left undefined, but  
> instead mapped to U+2F33 (?), the Unicode Kangxi Radical version of  
> U+5E7A (?), thus introducing not only the only Unicode Kangxi Radical  
> into the HKSCS standard, but also a Unicode mapping that is incompatible  
> with previous Big5 versions.  I wish I knew why.
>
>> Possibly, one could argue that U+2F33 normalizes (NFKC) to U+5E7A, but  
>> it's not the only hanzi in HKSCS-2008 that normalizes into something  
>> else:
>>
>> 8BC3 => <U+2F878> ? => <U+5C6E> ?
>> 8BF8 => <U+F907> ? => <U+9F9C> ?
>> 8EFD => <U+2F994> ? => <U+82B3> ?
>> 8FA8 => <U+2F9B2> ? => <U+456B> ?
>> 8FF0 => <U+2F9D4> ? => <U+8CAB> ?
>> C6CD => <U+2F33> ? => <U+5E7A> ?
>> 957A => <U+2F9BC> ? => <U+8728> ?
>> 9874 => <U+2F825> ? => <U+52C7> ?
>> 9AC8 => <U+2F83B> ? => <U+5406> ?
>> 9C52 => <U+2F8CD> ? => <U+6649> ?
>> A047 => <U+2F840> ? => <U+54A2> ?
>> FC48 => <U+2F894> ? => <U+5F22> ?
>> FC77 => <U+2F8A6> ? => <U+6148> ?
>
> The other pairs all contain characters that look slightly different,  
> whereas U+5E7A and U+2F33 look the same (and, I believe, are supposed to  
> look the same), the only difference being that the former is a normal  
> Han character whereas the latter carries the additional semantics of  
> being a Kangxi radical.

That the characters in the above list look slightly different is really a  
font issue, they are canonically equivalent in Unicode and therefore the  
same, AFAICT.

>> I'm not sure what the conclusion is...
>
> I am not entirely sure either.  It seems clear that the mapping from  
> C6CD to U+2F33 makes no sense for non-HKSCS Big5 (which does not encode  
> U+5E7A anywhere else), but it does not seem to make much sense for  
> Big5-HKSCS either, which suggests that I might be missing something.

U+2F33 is indeed the only Kangxi Radical (2F00-2FDF) mapped by HKSCS-2008  
and I agree that it's weird. However, unless U+2F33 causes problems on  
real-world pages, I'm not really comfortable with fixing bugs in  
HKSCS-2008, at least not based only on agreement by two Northern Europeans  
like us... If users or implementors from Hong Kong or Taiwan also speak up  
for U+5E7A, then I will not object. I posted  
<http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0001.html>  
a few days ago seeking such feedback, but so far no one has commented on  
this specific issue.

>>> On Fri Apr 6 06:42:26 PDT 2012, Philip J?genstedt <philipj at  
>>> opera.com> wrote:
>>>
>>>> Also, a single mapping fails the Big5-contra[di]ction test:
>>>>
>>>> F9FE =>
>>>> opera-hk: U+FFED ?
>>>> firefox: U+2593 ?
>>>> chrome: U+2593 ?
>>>> firefox-hk: U+2593 ?
>>>> opera: U+2593 ?
>>>> chrome-hk: U+FFED ?
>>>> internetexplorer: U+2593 ?
>>>> hkscs-2008: <U+FFED> ?
>>>>
>>>> I'd say that we should go with U+FFED here, since that's what the  
>>>> [HKSCS-2008] spec
>>>> says and it's visually close anyway.
>>>
>>> Given that the goal is to define a unified Big5 (non-HK) and  
>>> Big5-HKSCS encoding and that this seems to be a case of the HK  
>>> standard going against everything and everyone else, perhaps more  
>>> weight should be given to existing specifications and  
>>> (non-HK-specific) implementations.
>>>
>>> Suggested change:  map F9FE to U+2593
>>
>> This is the only mapping where IE maps something other than PUA or "?"  
>> that my mapping doesn't agree on, so I don't object to changing it.  
>> Still, it would be very interesting to know why HKSCS-2008 changed it,  
>> do you know?
>
> No, I am afraid not.  I have been wondering as well, but I have not been  
> able to find an explanation.
>
> Lunde (if I remember correctly, 1st Edn) and Kano's 'Developing  
> International Software' (1st Edn, 1995) both show something like U+2593,  
> but it could of course be that popular non-Unicode (HK) Big5 fonts had  
> glyphs more like U+FFED, which would make the HKSCS-2008 mapping less  
> surprising.  Do let me know if you discover any information on this.

On 8 Apr 2012, at 18:03, Philip J?genstedt wrote:

> I was misremembering:  Lunde actually shows a solid black square, so it  
> looks like Microsoft may have changed this in its CP950 and HKSCS-2008  
> restored the original meaning.  [U+FFED does not seem quite right  
> (half-width looks implausible), but let us not start discussing all the  
> different black solid squares in Unicode.]
>
> Given the above, following HKSCS-2008 appears to be the best solution,  
> which brings the number of problematic forward mappings down to one.

U+FFED decomposes to U+25A0 which could perhaps be more appropriate, but I  
suggest sticking with U+FFED and recommending people to use UTF-8 if they  
want some particular square shape.

>>> Duplicates and reverse mappings:
>>>
>>> [...]
>>
>> [...] it clearly needs to be defined what to do for these 100 code  
>> points that have multiple mappings to Big5. I extended my Python script  
>> to find these 100 duplicates and to check what Python did for 'big5',  
>> falling back to 'big5-hkscs'. This is what it produced:
>>
>> [...]
>>
>> These are the ones where you (?istein) disagree:
>>
>>> C6CF <= U+5EF4
>>> C6D3 <= U+65E0
>>> C6D5 <= U+7676
>>> C6D7 <= U+96B6
>>
>> AFAICT this has nothing to do with compatibility mappings, so what's  
>> the reason for this?
>
> As I wrote, '[o]nly these mappings will work for non-HK Big5  
> implementations.'  My reasoning was that a random Big5 implementation  
> would be more likely to include the E-Ten 1 extension than the HKSCS  
> extension.  On the other hand, these codepoints could be less than ideal  
> if major Big5-HKSCS implementations follow the standard strictly and map  
> to nothing.


>>> F9E9 <= U+255E
>>> F9EA <= U+256A
>>> F9EB <= U+2561
>>> F9F9 <= U+2550
>>
>> Python's big5-hkscs agrees, but Python's big5 does this instead:
>>
>> A2A5 <= U+255E
>> A2A6 <= U+256A
>> A2A7 <= U+2561
>> A2A4 <= U+2550
>>
>> It seems safer to go with the big5 mappings, but checking what browsers  
>> do would be helpful.
>
> Does this imply that Python's big5 (non-HK) implementation does not  
> include the corresponding E-Ten 2 (forward) mappings for decoding either?

So says python3:

>>> b'\xf9\xe9'.decode('big5')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1:  
illegal multibyte sequence
>>> b'\xf9\xe9'.decode('big5-hkscs')
'?'

A2A4-A2A7 are fine in both big5 and big5-hkscs, however.

On Tue, 10 Apr 2012 17:00:03 +0200, ?istein E. Andersen <liszt at coq.no>  
wrote:

> Getting the double-stroked circle segments at F9FB..F9FD added to  
> Unicode would make it possible to provide Unicode mappings in accordance  
> with the original intent and remove four duplicate mappings.  This might  
> be worthwhile if the characters have not been proposed and rejected  
> already.

Are there any sites that use these line drawing characters that would be  
fixed by this? If not, I'm quite willing to accept the historical  
accidents and move on :)

-- 
Philip J?genstedt
Core Developer
Opera Software
Received on Thursday, 12 April 2012 00:26:51 UTC