Re: Encoding single-byte tests

On Tue, Sep 2, 2014 at 2:59 AM, Anne van Kesteren <annevk@annevk.nl> wrote:

> On Mon, Sep 1, 2014 at 2:20 PM, Richard Ishida <ishida@w3.org> wrote:
> >
> http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases
>
> This data seems to show the following:
>
> 1. Firefox has a bug in the windows-* encodings:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1058021 (It used to have
> this bug for iso-8859-* encodings too, that was fixed independently
> much longer ago.)
> 2. Internet Explorer frequently uses distinct PUA code points rather
> than U+FFFD.
> 3. For windows-1253 and windows-874 browsers used a strategy that
> deviates from their strategy for other encodings.
>
> I think only point 3 is worth looking into further, so let's do that.
>
> For windows-1253 it seems Firefox' problem is only 1. It otherwise
> fully matches Encoding (and therefore will soon by compliant). For
> Internet Explorer it is 2. Chrome and Safari are nearly identical to
> Encoding apart from 0xAA, which they map to U+00AA rather than U+FFFD
> for unclear reasons. They do have the other two U+FFFD code points and
> do not pass the byte through there. Seems like a bug.
>





>
> For windows-874 it seems Firefox' problem is 1 again. Internet
> Explorer's problem is 2 again. And for some weird reason Chrome and
> Safari follow Internet Explorer here rather than not emitting PUA code
> points as they do for all other windows-* encodings. That also seems
> like a bug, though if there's a particular reason that would be
> interesting to know.
>
>
I don't think that you can call either of the two issues a bug per se.
They're different behaviors and what we're dealing with is not 'correct vs
incorrect' issue, but how to get to a 'consensus' out of different
implementations with a lot of historical baggage. I'm afraid even seemingly
single-byte-encodings are not as straightforward to standardize as the
current encoding spec seems to assume.

Actually, I was surprised to see Richard's test results have only two
'yellows' for Chrome/Opera that mostly use ICU's default conversion rules
for most of the single byte encodings.  Then, I realized that his tests
only tests 'decoding', but 'encoding' is another bag of worms.  For
instance,  windows-874-2000.ucm in ICU
<http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/windows-874-2000.ucm>
has
a number of entries tagged with '| 1' meaning that they're only for
encoding.  ('|3' denotes 'decoding only').

<U0074> \x74 | 0    : round-trip

<UFF54> \x74 |1   : encoding-only # full-width Latin Small Letter T


And, it appears that all windows-125x tables in ICU (I haven't checked
them all) do encode U+FFxx (full-width ASCII block) to the
corresponding ASCII-range code points. That means, Chrome, Opera and
Safari do this. IE is likely to do the same, too.


Windows-1252 : http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/ibm-5348_P100-1997.ucm

Windows-1253 : http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/ibm-5349_P100-1998.ucm

OTOH, encoding U+2554 (box drawing) to 0xC9 seems to be a genuine bug
because it does not make sense to encode both U+2554 and U+0E29 (Thai
character) to 0xC9 in Windows-874.

<U2554> \xC9 |1
<U0E29> \xC9 |0

I filed a bug against ICU on this issue :
http://bugs.icu-project.org/trac/ticket/11231



> Overall, based on these (revised) tests I still don't see a compelling
> reason to change the Encoding Standard.
>
>
Not based on Richard's test results. Nonetheless,  the encoding rules for
single-byte-encoding has to be revisited, IMHO, given what I wrote above.

Jungshik




>
> --
> http://annevankesteren.nl/
>
>

Received on Tuesday, 2 September 2014 18:19:39 UTC