Re: Encoding single-byte tests from 신정식 on 2014-09-02 (www-international@w3.org from July to September 2014)

From: 신정식 <jshin1987+w3@gmail.com>
Date: Tue, 2 Sep 2014 12:05:14 -0700
To: Anne van Kesteren <annevk@annevk.nl>
Cc: Richard Ishida <ishida@w3.org>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, www International <www-international@w3.org>, Philippe Le Hegaret <plh@w3.org>
Message-ID: <CAE1ONj9htLtAQG1BfuBz3JCGDVDgw3ARmKTXxLr1G682rR91fw@mail.gmail.com>

On Tue, Sep 2, 2014 at 11:19 AM, Jungshik SHIN (신정식) <jshin1987+w3@gmail.com
> wrote:

>
>
>
> On Tue, Sep 2, 2014 at 2:59 AM, Anne van Kesteren <annevk@annevk.nl>
> wrote:
>
>> On Mon, Sep 1, 2014 at 2:20 PM, Richard Ishida <ishida@w3.org> wrote:
>> >
>> http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases
>>
>> This data seems to show the following:
>>
>> 1. Firefox has a bug in the windows-* encodings:
>> https://bugzilla.mozilla.org/show_bug.cgi?id=1058021 (It used to have
>> this bug for iso-8859-* encodings too, that was fixed independently
>> much longer ago.)
>> 2. Internet Explorer frequently uses distinct PUA code points rather
>> than U+FFFD.
>> 3. For windows-1253 and windows-874 browsers used a strategy that
>> deviates from their strategy for other encodings.
>>
>> I think only point 3 is worth looking into further, so let's do that.
>>
>> For windows-1253 it seems Firefox' problem is only 1. It otherwise
>> fully matches Encoding (and therefore will soon by compliant). For
>> Internet Explorer it is 2. Chrome and Safari are nearly identical to
>> Encoding apart from 0xAA, which they map to U+00AA rather than U+FFFD
>> for unclear reasons. They do have the other two U+FFFD code points and
>> do not pass the byte through there. Seems like a bug.
>>
>
>
>
>
>
>>
>> For windows-874 it seems Firefox' problem is 1 again. Internet
>> Explorer's problem is 2 again. And for some weird reason Chrome and
>> Safari follow Internet Explorer here rather than not emitting PUA code
>> points as they do for all other windows-* encodings. That also seems
>> like a bug, though if there's a particular reason that would be
>> interesting to know.
>>
>>
> I don't think that you can call either of the two issues a bug per se.
> They're different behaviors and what we're dealing with is not 'correct vs
> incorrect' issue, but how to get to a 'consensus' out of different
> implementations with a lot of historical baggage. I'm afraid even seemingly
> single-byte-encodings are not as straightforward to standardize as the
> current encoding spec seems to assume.
>
> Actually, I was surprised to see Richard's test results have only two
> 'yellows' for Chrome/Opera that mostly use ICU's default conversion rules
> for most of the single byte encodings.  Then, I realized that his tests
> only tests 'decoding', but 'encoding' is another bag of worms.  For
> instance,  windows-874-2000.ucm in ICU
> <http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/windows-874-2000.ucm> has
> a number of entries tagged with '| 1' meaning that they're only for
> encoding.  ('|3' denotes 'decoding only').
>
> <U0074> \x74 | 0    : round-trip
>
> <UFF54> \x74 |1   : encoding-only # full-width Latin Small Letter T
>
>
> And, it appears that all windows-125x tables in ICU (I haven't checked them all) do encode U+FFxx (full-width ASCII block) to the corresponding ASCII-range code points. That means, Chrome, Opera and Safari do this. IE is likely to do the same, too.
>
>
I've just checked IE 10 with windows-{1252 ,1253, 874} and IE does not do
this.  So, it's {Firefox, IE} vs {Chrome/Opera, Safari}.

I'm not sure which would be the best for the spec. Chrome/Opera can easily
change its behavior in ToT, but it may take a while for Safari to do that.

FYI, there's an ICU bug to add conversion tables to be compliant to the
encoding spec ( http://www.icu-project.org/trac/ticket/10303 ).

Jungshik


>
> Windows-1252 : http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/ibm-5348_P100-1997.ucm
>
> Windows-1253 : http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/ibm-5349_P100-1998.ucm
>
> OTOH, encoding U+2554 (box drawing) to 0xC9 seems to be a genuine bug because it does not make sense to encode both U+2554 and U+0E29 (Thai character) to 0xC9 in Windows-874.
>
> <U2554> \xC9 |1
> <U0E29> \xC9 |0
>
> I filed a bug against ICU on this issue :
> http://bugs.icu-project.org/trac/ticket/11231
>
>
>
>> Overall, based on these (revised) tests I still don't see a compelling
>> reason to change the Encoding Standard.
>>
>>
> Not based on Richard's test results. Nonetheless,  the encoding rules for
> single-byte-encoding has to be revisited, IMHO, given what I wrote above.
>
> Jungshik
>
>
>
>
>>
>> --
>> http://annevankesteren.nl/
>>
>>
>

Received on Tuesday, 2 September 2014 19:05:43 UTC