- From: Řistein E. Andersen <liszt@coq.no>
- Date: Thu, 12 Apr 2012 10:52:20 +0100
On 12 Apr 2012, at 08:26, Philip J?genstedt wrote: >>> Possibly, one could argue that U+2F33 normalizes (NFKC) to U+5E7A, but it's not the only hanzi in HKSCS-2008 that normalizes into something else: > > That the characters in the above list look slightly different is really a font issue, they are canonically equivalent in Unicode and therefore the same, AFAICT. Sorry, you are right about that, of course. U+2F33 and U+5E7A are not canonically equivalent, and I just assumed that was the case for the others as well without thinking. > U+2F33 is indeed the only Kangxi Radical (2F00-2FDF) mapped by HKSCS-2008 and I agree that it's weird. However [...], I'm not really comfortable with fixing bugs in HKSCS-2008, at least not based only on agreement by two Northern Europeans like us... If users or implementors from Hong Kong or Taiwan also speak up for U+5E7A, then I will not object. I certainly agree with that sentiment. >>>>> F9FE => > [...] > U+FFED decomposes to U+25A0 which could perhaps be more appropriate, Yes, except that A1BD maps to U+25A0. > but I suggest sticking with U+FFED and recommending people to use UTF-8 if they want some particular square shape. That makes sense. Cf. python again for a less web-centric point of view: >>> b'\xf9\xfe'.decode('big5-hkscs') u'\uffed' >>> b'\xf9\xfe'.decode('cp950') u'\u2593' >>> b'\xf9\xfe'.decode('big5') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1: illegal multibyte sequence >> Does this imply that Python's big5 (non-HK) implementation does not include the corresponding E-Ten 2 (forward) mappings for decoding either? > > So says python3: > >>>> b'\xf9\xe9'.decode('big5') > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1: illegal multibyte sequence >>>> b'\xf9\xe9'.decode('big5-hkscs') > '?' Python also says: >>> b'\xf9\xe9'.decode('cp950') u'\u255e' > Are there any sites that use these line drawing characters that would be fixed by this? If not, I'm quite willing to accept the historical accidents and move on :) Probably not many. Still, it seems safe to fix these four mappings if the characters are ever added to Unicode. ?istein E. Andersen
Received on Thursday, 12 April 2012 02:52:20 UTC