- From: 신정식 <jshin1987+w3@gmail.com>
- Date: Tue, 2 Sep 2014 12:05:14 -0700
- To: Anne van Kesteren <annevk@annevk.nl>
- Cc: Richard Ishida <ishida@w3.org>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, www International <www-international@w3.org>, Philippe Le Hegaret <plh@w3.org>
- Message-ID: <CAE1ONj9htLtAQG1BfuBz3JCGDVDgw3ARmKTXxLr1G682rR91fw@mail.gmail.com>
On Tue, Sep 2, 2014 at 11:19 AM, Jungshik SHIN (신정식) <jshin1987+w3@gmail.com > wrote: > > > > On Tue, Sep 2, 2014 at 2:59 AM, Anne van Kesteren <annevk@annevk.nl> > wrote: > >> On Mon, Sep 1, 2014 at 2:20 PM, Richard Ishida <ishida@w3.org> wrote: >> > >> http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases >> >> This data seems to show the following: >> >> 1. Firefox has a bug in the windows-* encodings: >> https://bugzilla.mozilla.org/show_bug.cgi?id=1058021 (It used to have >> this bug for iso-8859-* encodings too, that was fixed independently >> much longer ago.) >> 2. Internet Explorer frequently uses distinct PUA code points rather >> than U+FFFD. >> 3. For windows-1253 and windows-874 browsers used a strategy that >> deviates from their strategy for other encodings. >> >> I think only point 3 is worth looking into further, so let's do that. >> >> For windows-1253 it seems Firefox' problem is only 1. It otherwise >> fully matches Encoding (and therefore will soon by compliant). For >> Internet Explorer it is 2. Chrome and Safari are nearly identical to >> Encoding apart from 0xAA, which they map to U+00AA rather than U+FFFD >> for unclear reasons. They do have the other two U+FFFD code points and >> do not pass the byte through there. Seems like a bug. >> > > > > > >> >> For windows-874 it seems Firefox' problem is 1 again. Internet >> Explorer's problem is 2 again. And for some weird reason Chrome and >> Safari follow Internet Explorer here rather than not emitting PUA code >> points as they do for all other windows-* encodings. That also seems >> like a bug, though if there's a particular reason that would be >> interesting to know. >> >> > I don't think that you can call either of the two issues a bug per se. > They're different behaviors and what we're dealing with is not 'correct vs > incorrect' issue, but how to get to a 'consensus' out of different > implementations with a lot of historical baggage. I'm afraid even seemingly > single-byte-encodings are not as straightforward to standardize as the > current encoding spec seems to assume. > > Actually, I was surprised to see Richard's test results have only two > 'yellows' for Chrome/Opera that mostly use ICU's default conversion rules > for most of the single byte encodings. Then, I realized that his tests > only tests 'decoding', but 'encoding' is another bag of worms. For > instance, windows-874-2000.ucm in ICU > <http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/windows-874-2000.ucm> has > a number of entries tagged with '| 1' meaning that they're only for > encoding. ('|3' denotes 'decoding only'). > > <U0074> \x74 | 0 : round-trip > > <UFF54> \x74 |1 : encoding-only # full-width Latin Small Letter T > > > And, it appears that all windows-125x tables in ICU (I haven't checked them all) do encode U+FFxx (full-width ASCII block) to the corresponding ASCII-range code points. That means, Chrome, Opera and Safari do this. IE is likely to do the same, too. > > I've just checked IE 10 with windows-{1252 ,1253, 874} and IE does not do this. So, it's {Firefox, IE} vs {Chrome/Opera, Safari}. I'm not sure which would be the best for the spec. Chrome/Opera can easily change its behavior in ToT, but it may take a while for Safari to do that. FYI, there's an ICU bug to add conversion tables to be compliant to the encoding spec ( http://www.icu-project.org/trac/ticket/10303 ). Jungshik > > Windows-1252 : http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/ibm-5348_P100-1997.ucm > > Windows-1253 : http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/ibm-5349_P100-1998.ucm > > OTOH, encoding U+2554 (box drawing) to 0xC9 seems to be a genuine bug because it does not make sense to encode both U+2554 and U+0E29 (Thai character) to 0xC9 in Windows-874. > > <U2554> \xC9 |1 > <U0E29> \xC9 |0 > > I filed a bug against ICU on this issue : > http://bugs.icu-project.org/trac/ticket/11231 > > > >> Overall, based on these (revised) tests I still don't see a compelling >> reason to change the Encoding Standard. >> >> > Not based on Richard's test results. Nonetheless, the encoding rules for > single-byte-encoding has to be revisited, IMHO, given what I wrote above. > > Jungshik > > > > >> >> -- >> http://annevankesteren.nl/ >> >> >
Received on Tuesday, 2 September 2014 19:05:43 UTC