[whatwg] Encodings and the web from Anne van Kesteren on 2012-01-08 (public-whatwg-archive@w3.org from January 2012)

From: Anne van Kesteren <annevk@opera.com>
Date: Sun, 08 Jan 2012 15:32:47 +0100
Message-ID: <op.v7rr0xcl64w2qv@annevk-macbookpro.local>
On Sun, 08 Jan 2012 01:37:14 +0100, NARUSE, Yui <naruse at airemix.jp> wrote:
> = Legacy multi-octet Chinese (traditional) encodings
>
> Mozilla supports another Big5 variants, Big5-UAO.
> http://bugs.ruby-lang.org/issues/1784

As part of the big5 encoding, right? It sounds like it's a good idea to  
adopt that. I don't think there's much concern about table size these  
days, though obviously the less complexity the better.


> = Legacy multi-octet Japanese encodings
>
>> The jis code point for a given number is: ...
>> The jis0208 index for a given octet is:
>
> I wonder about this description.
> I should explain the concept of JIS X 0208.
>
> The most important thing is that JIS X 0208 is on the context of ISO  
> 2022.
> Its target is ISO/IEC 2022 double byte 94 characters set.
> It means its code space is 94 x 94.
> http://en.wikipedia.org/wiki/JIS_X_0208
>
> At the top, there is kuten numbers.
> "ku" is row, expressed by the first one of double byte code.
> "ten" is cell, expressed by the second one of doubye byte code.
> So kuten number expresses a code-point.
> Both ku and ten is an integer from 1 to 94.
> For example Hiragana Character A, its kuten number is 04-01.
>
> ISO-2022-JP, EUC-JP, and Shift_JIS map a kuten number to bytes.
> ISO-2022-JP's double bytes are:
>  first:  ku  + 0x20
>  second: ten + 0x20
> EUC-JP's double bytes are:
>  first:  ku  + 0xA0
>  second: ten + 0xA0
> Shift_JIS's double bytes are:
>  first:  if    1 <= ku <= 62 then (ku-1) / 2 + 0x81
>          elif 63 <= ku <= 94 then (ku-1) / 2 + 0xC1
>  second: if ku is even
>            if    1 <= ku <= 63 then ten + 0x3F
>            elif 64 <= ku <= 94 then ten + 0x40
>          elif ku is odd then ten + 0x9E
>
>
> So theoretically, we should make a conversion table between
> kuten numbers and Unicode scalar values.
>
> But as you know, "JIS X 0208" in web context should be Windows Code Page  
> 932,
> extended by Microsoft.
> http://msdn.microsoft.com/en-us/goglobal/cc305152
> It is defined by Shift_JIS.
>
>> The jis0212 index for a given octet is:
>
> As written in Bugzilla at Mozilla Bug 600715, IE doesn't support JIS X 0212.
> https://bugzilla.mozilla.org/show_bug.cgi?id=600715
> How treat X0212 in this Encoding spec will be a problem.

Yeah so currently I used Gecko's approach (roughly) towards Japanese  
encodings, including how they put both 0208 and 0212 in a single longish  
array. But maybe instead I should write it down as it has been done by  
Unicode.org, with double-octet sequence mapping to a Unicode character.  
Suggestions welcome.

With respect to 0212, it's not that hard to support it and given how long  
it has been deployed this way it's probably safer to keep it there I think.


> == iso-2022-jp
> === The to Unicode algorithm
> ==== Based on iso-2022-jp state
> ===== ASCII state
> ====== Based on octet:
> ======= Otherwise
>> If the fatal flag is set, return failure.
>> Otherwise, emit the fallback code point.
>
> Just FYI, IE and Opera show these bytes as Katakana.
> If octet is greater than 0xA0 and less than 0xE0, value is octet +  
> 0xFEC0.
>
> Moreover IE shows any shift_jis characters here.
> It seems that IE uses the same converter both iso-2022-jp and shift_jis.

I have filed a bug on Opera to become more strict like Webkit/Gecko. If  
there is some evidence that approach is wrong though, we can turn it  
around.


-- 
Anne van Kesteren
http://annevankesteren.nl/
Received on Sunday, 8 January 2012 06:32:47 UTC