- From: 신정식 <jshin1987+w3@gmail.com>
- Date: Tue, 2 Sep 2014 11:19:09 -0700
- To: Anne van Kesteren <annevk@annevk.nl>
- Cc: Richard Ishida <ishida@w3.org>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, www International <www-international@w3.org>, Philippe Le Hegaret <plh@w3.org>
- Message-ID: <CAE1ONj-eyd2oGnXZ0OP=jLwRXAxuFGktzmq+-=Gu0XG3hsevVw@mail.gmail.com>
On Tue, Sep 2, 2014 at 2:59 AM, Anne van Kesteren <annevk@annevk.nl> wrote: > On Mon, Sep 1, 2014 at 2:20 PM, Richard Ishida <ishida@w3.org> wrote: > > > http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases > > This data seems to show the following: > > 1. Firefox has a bug in the windows-* encodings: > https://bugzilla.mozilla.org/show_bug.cgi?id=1058021 (It used to have > this bug for iso-8859-* encodings too, that was fixed independently > much longer ago.) > 2. Internet Explorer frequently uses distinct PUA code points rather > than U+FFFD. > 3. For windows-1253 and windows-874 browsers used a strategy that > deviates from their strategy for other encodings. > > I think only point 3 is worth looking into further, so let's do that. > > For windows-1253 it seems Firefox' problem is only 1. It otherwise > fully matches Encoding (and therefore will soon by compliant). For > Internet Explorer it is 2. Chrome and Safari are nearly identical to > Encoding apart from 0xAA, which they map to U+00AA rather than U+FFFD > for unclear reasons. They do have the other two U+FFFD code points and > do not pass the byte through there. Seems like a bug. > > > For windows-874 it seems Firefox' problem is 1 again. Internet > Explorer's problem is 2 again. And for some weird reason Chrome and > Safari follow Internet Explorer here rather than not emitting PUA code > points as they do for all other windows-* encodings. That also seems > like a bug, though if there's a particular reason that would be > interesting to know. > > I don't think that you can call either of the two issues a bug per se. They're different behaviors and what we're dealing with is not 'correct vs incorrect' issue, but how to get to a 'consensus' out of different implementations with a lot of historical baggage. I'm afraid even seemingly single-byte-encodings are not as straightforward to standardize as the current encoding spec seems to assume. Actually, I was surprised to see Richard's test results have only two 'yellows' for Chrome/Opera that mostly use ICU's default conversion rules for most of the single byte encodings. Then, I realized that his tests only tests 'decoding', but 'encoding' is another bag of worms. For instance, windows-874-2000.ucm in ICU <http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/windows-874-2000.ucm> has a number of entries tagged with '| 1' meaning that they're only for encoding. ('|3' denotes 'decoding only'). <U0074> \x74 | 0 : round-trip <UFF54> \x74 |1 : encoding-only # full-width Latin Small Letter T And, it appears that all windows-125x tables in ICU (I haven't checked them all) do encode U+FFxx (full-width ASCII block) to the corresponding ASCII-range code points. That means, Chrome, Opera and Safari do this. IE is likely to do the same, too. Windows-1252 : http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/ibm-5348_P100-1997.ucm Windows-1253 : http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/ibm-5349_P100-1998.ucm OTOH, encoding U+2554 (box drawing) to 0xC9 seems to be a genuine bug because it does not make sense to encode both U+2554 and U+0E29 (Thai character) to 0xC9 in Windows-874. <U2554> \xC9 |1 <U0E29> \xC9 |0 I filed a bug against ICU on this issue : http://bugs.icu-project.org/trac/ticket/11231 > Overall, based on these (revised) tests I still don't see a compelling > reason to change the Encoding Standard. > > Not based on Richard's test results. Nonetheless, the encoding rules for single-byte-encoding has to be revisited, IMHO, given what I wrote above. Jungshik > > -- > http://annevankesteren.nl/ > >
Received on Tuesday, 2 September 2014 18:19:39 UTC