- From: Philip Jägenstedt <philipj@opera.com>
- Date: Thu, 12 Apr 2012 09:26:51 +0200
On Mon, 09 Apr 2012 03:08:20 +0200, ?istein E. Andersen <liszt at coq.no> wrote: > On 8 Apr 2012, at 18:03, Philip J?genstedt wrote: > >> On Sat, 07 Apr 2012 16:04:55 +0200, ?istein E. Andersen <liszt at coq.no> >> wrote: >>> Suggested change: map C6CD to U+5E7A. >> >> These are the existing mappings: >> >> C6CD => >> opera-hk: U+2F33 ? >> firefox: U+5E7A ? >> chrome: U+F6DD ? >> firefox-hk: U+5E7A ? >> opera: U+2F33 ? >> chrome-hk: U+2F33 ? >> internetexplorer: U+F6DD ? >> hkscs-2008: <U+2F33> ? >> >> At least on the Web, this isn't a question of HK vs non-HK mappings. >> Other than Firefox, which (de-facto) specs or implementations use >> U+5E7A? > > I have now had a closer look at my notes > (<http://coq.no/character-tables/chinese-traditional/en>). My argument > for U+5E7A goes as follows: > > Of the 214 Kangxi radicals, 186 appear (as normal Han character) in CNS > 11643 Planes 1 or 2, whereas 25 appear in Plane 3 and 3 are missing > altogether. Big5 only covers Planes 1 and 2, which means that 28 Kangxi > radicals (which may be rare in running text, but are nevertheless > important) are missing. The E-Ten 1 extension encodes 25 of the missing > radicals in the range C6BF--C6D7. Unlike CNS 11643 and Unicode, Big5 > does not encode radicals twice (as radicals and normal characters). > This means that Big5 with the E-Ten 1 extension contains 211 of the 214 > Kangxi radicals, all mapped to normal Han characters, and no codepoints > mapped to Unicode Kangxi Radicals in the range U+2F00--U+2FD5. > > In summary: although E-Ten 1 was not defined in terms of Unicode, it is > clear that the 25 radicals were all meant to map to normal Han > characters, not to the special radical characters found in CNS 11643 and > Unicode. > > Enter HKSCS. 20 of the E-Ten 1 Kangxi radical mappings (along with the > rest of E-Ten 1 and E-Ten 2, or almost) are adopted, but the remaining 5 > are instead given new codepoints elsewhere. Whatever the reason be, 4 > of the 5 unused E-Ten positions are simply left undefined in the HKSCS > standard, which is not much of a problem for a unified HK/non-HK Big5 > encoding. Unfortunately, the position C6CD was not left undefined, but > instead mapped to U+2F33 (?), the Unicode Kangxi Radical version of > U+5E7A (?), thus introducing not only the only Unicode Kangxi Radical > into the HKSCS standard, but also a Unicode mapping that is incompatible > with previous Big5 versions. I wish I knew why. > >> Possibly, one could argue that U+2F33 normalizes (NFKC) to U+5E7A, but >> it's not the only hanzi in HKSCS-2008 that normalizes into something >> else: >> >> 8BC3 => <U+2F878> ? => <U+5C6E> ? >> 8BF8 => <U+F907> ? => <U+9F9C> ? >> 8EFD => <U+2F994> ? => <U+82B3> ? >> 8FA8 => <U+2F9B2> ? => <U+456B> ? >> 8FF0 => <U+2F9D4> ? => <U+8CAB> ? >> C6CD => <U+2F33> ? => <U+5E7A> ? >> 957A => <U+2F9BC> ? => <U+8728> ? >> 9874 => <U+2F825> ? => <U+52C7> ? >> 9AC8 => <U+2F83B> ? => <U+5406> ? >> 9C52 => <U+2F8CD> ? => <U+6649> ? >> A047 => <U+2F840> ? => <U+54A2> ? >> FC48 => <U+2F894> ? => <U+5F22> ? >> FC77 => <U+2F8A6> ? => <U+6148> ? > > The other pairs all contain characters that look slightly different, > whereas U+5E7A and U+2F33 look the same (and, I believe, are supposed to > look the same), the only difference being that the former is a normal > Han character whereas the latter carries the additional semantics of > being a Kangxi radical. That the characters in the above list look slightly different is really a font issue, they are canonically equivalent in Unicode and therefore the same, AFAICT. >> I'm not sure what the conclusion is... > > I am not entirely sure either. It seems clear that the mapping from > C6CD to U+2F33 makes no sense for non-HKSCS Big5 (which does not encode > U+5E7A anywhere else), but it does not seem to make much sense for > Big5-HKSCS either, which suggests that I might be missing something. U+2F33 is indeed the only Kangxi Radical (2F00-2FDF) mapped by HKSCS-2008 and I agree that it's weird. However, unless U+2F33 causes problems on real-world pages, I'm not really comfortable with fixing bugs in HKSCS-2008, at least not based only on agreement by two Northern Europeans like us... If users or implementors from Hong Kong or Taiwan also speak up for U+5E7A, then I will not object. I posted <http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0001.html> a few days ago seeking such feedback, but so far no one has commented on this specific issue. >>> On Fri Apr 6 06:42:26 PDT 2012, Philip J?genstedt <philipj at >>> opera.com> wrote: >>> >>>> Also, a single mapping fails the Big5-contra[di]ction test: >>>> >>>> F9FE => >>>> opera-hk: U+FFED ? >>>> firefox: U+2593 ? >>>> chrome: U+2593 ? >>>> firefox-hk: U+2593 ? >>>> opera: U+2593 ? >>>> chrome-hk: U+FFED ? >>>> internetexplorer: U+2593 ? >>>> hkscs-2008: <U+FFED> ? >>>> >>>> I'd say that we should go with U+FFED here, since that's what the >>>> [HKSCS-2008] spec >>>> says and it's visually close anyway. >>> >>> Given that the goal is to define a unified Big5 (non-HK) and >>> Big5-HKSCS encoding and that this seems to be a case of the HK >>> standard going against everything and everyone else, perhaps more >>> weight should be given to existing specifications and >>> (non-HK-specific) implementations. >>> >>> Suggested change: map F9FE to U+2593 >> >> This is the only mapping where IE maps something other than PUA or "?" >> that my mapping doesn't agree on, so I don't object to changing it. >> Still, it would be very interesting to know why HKSCS-2008 changed it, >> do you know? > > No, I am afraid not. I have been wondering as well, but I have not been > able to find an explanation. > > Lunde (if I remember correctly, 1st Edn) and Kano's 'Developing > International Software' (1st Edn, 1995) both show something like U+2593, > but it could of course be that popular non-Unicode (HK) Big5 fonts had > glyphs more like U+FFED, which would make the HKSCS-2008 mapping less > surprising. Do let me know if you discover any information on this. On 8 Apr 2012, at 18:03, Philip J?genstedt wrote: > I was misremembering: Lunde actually shows a solid black square, so it > looks like Microsoft may have changed this in its CP950 and HKSCS-2008 > restored the original meaning. [U+FFED does not seem quite right > (half-width looks implausible), but let us not start discussing all the > different black solid squares in Unicode.] > > Given the above, following HKSCS-2008 appears to be the best solution, > which brings the number of problematic forward mappings down to one. U+FFED decomposes to U+25A0 which could perhaps be more appropriate, but I suggest sticking with U+FFED and recommending people to use UTF-8 if they want some particular square shape. >>> Duplicates and reverse mappings: >>> >>> [...] >> >> [...] it clearly needs to be defined what to do for these 100 code >> points that have multiple mappings to Big5. I extended my Python script >> to find these 100 duplicates and to check what Python did for 'big5', >> falling back to 'big5-hkscs'. This is what it produced: >> >> [...] >> >> These are the ones where you (?istein) disagree: >> >>> C6CF <= U+5EF4 >>> C6D3 <= U+65E0 >>> C6D5 <= U+7676 >>> C6D7 <= U+96B6 >> >> AFAICT this has nothing to do with compatibility mappings, so what's >> the reason for this? > > As I wrote, '[o]nly these mappings will work for non-HK Big5 > implementations.' My reasoning was that a random Big5 implementation > would be more likely to include the E-Ten 1 extension than the HKSCS > extension. On the other hand, these codepoints could be less than ideal > if major Big5-HKSCS implementations follow the standard strictly and map > to nothing. >>> F9E9 <= U+255E >>> F9EA <= U+256A >>> F9EB <= U+2561 >>> F9F9 <= U+2550 >> >> Python's big5-hkscs agrees, but Python's big5 does this instead: >> >> A2A5 <= U+255E >> A2A6 <= U+256A >> A2A7 <= U+2561 >> A2A4 <= U+2550 >> >> It seems safer to go with the big5 mappings, but checking what browsers >> do would be helpful. > > Does this imply that Python's big5 (non-HK) implementation does not > include the corresponding E-Ten 2 (forward) mappings for decoding either? So says python3: >>> b'\xf9\xe9'.decode('big5') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'big5' codec can't decode bytes in position 0-1: illegal multibyte sequence >>> b'\xf9\xe9'.decode('big5-hkscs') '?' A2A4-A2A7 are fine in both big5 and big5-hkscs, however. On Tue, 10 Apr 2012 17:00:03 +0200, ?istein E. Andersen <liszt at coq.no> wrote: > Getting the double-stroked circle segments at F9FB..F9FD added to > Unicode would make it possible to provide Unicode mappings in accordance > with the original intent and remove four duplicate mappings. This might > be worthwhile if the characters have not been proposed and rejected > already. Are there any sites that use these line drawing characters that would be fixed by this? If not, I'm quite willing to accept the historical accidents and move on :) -- Philip J?genstedt Core Developer Opera Software
Received on Thursday, 12 April 2012 00:26:51 UTC