[whatwg] Encoding: big5 and big5-hkscs

On 12 Apr 2012, at 08:26, Philip J?genstedt wrote:

>>> Possibly, one could argue that U+2F33 normalizes (NFKC) to U+5E7A, but it's not the only hanzi in HKSCS-2008 that normalizes into something else:
> 
> That the characters in the above list look slightly different is really a font issue, they are canonically equivalent in Unicode and therefore the same, AFAICT.

Sorry, you are right about that, of course.  U+2F33 and U+5E7A are not canonically equivalent, and I just assumed that was the case for the others as well without thinking.

> U+2F33 is indeed the only Kangxi Radical (2F00-2FDF) mapped by HKSCS-2008 and I agree that it's weird. However [...], I'm not really comfortable with fixing bugs in HKSCS-2008, at least not based only on agreement by two Northern Europeans like us... If users or implementors from Hong Kong or Taiwan also speak up for U+5E7A, then I will not object.

I certainly agree with that sentiment.

>>>>> F9FE =>
> [...]
> U+FFED decomposes to U+25A0 which could perhaps be more appropriate,

Yes, except that A1BD maps to U+25A0.

> but I suggest sticking with U+FFED and recommending people to use UTF-8 if they want some particular square shape.

That makes sense.  Cf. python again for a less web-centric point of view:

>>> b'\xf9\xfe'.decode('big5-hkscs')
u'\uffed'
>>> b'\xf9\xfe'.decode('cp950')
u'\u2593'
>>> b'\xf9\xfe'.decode('big5')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1: illegal multibyte sequence

>> Does this imply that Python's big5 (non-HK) implementation does not include the corresponding E-Ten 2 (forward) mappings for decoding either?
> 
> So says python3:
> 
>>>> b'\xf9\xe9'.decode('big5')
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1: illegal multibyte sequence
>>>> b'\xf9\xe9'.decode('big5-hkscs')
> '?'

Python also says:

>>> b'\xf9\xe9'.decode('cp950')
u'\u255e'

> Are there any sites that use these line drawing characters that would be fixed by this? If not, I'm quite willing to accept the historical accidents and move on :)

Probably not many.  Still, it seems safe to fix these four mappings if the characters are ever added to Unicode.

?istein E. Andersen

Received on Thursday, 12 April 2012 02:52:20 UTC