Re: Encoding single-byte tests

On Tue, Sep 2, 2014 at 2:59 AM, Anne van Kesteren <> wrote:

> On Mon, Sep 1, 2014 at 2:20 PM, Richard Ishida <> wrote:
> >
> This data seems to show the following:
> 1. Firefox has a bug in the windows-* encodings:
> (It used to have
> this bug for iso-8859-* encodings too, that was fixed independently
> much longer ago.)
> 2. Internet Explorer frequently uses distinct PUA code points rather
> than U+FFFD.
> 3. For windows-1253 and windows-874 browsers used a strategy that
> deviates from their strategy for other encodings.
> I think only point 3 is worth looking into further, so let's do that.
> For windows-1253 it seems Firefox' problem is only 1. It otherwise
> fully matches Encoding (and therefore will soon by compliant). For
> Internet Explorer it is 2. Chrome and Safari are nearly identical to
> Encoding apart from 0xAA, which they map to U+00AA rather than U+FFFD
> for unclear reasons. They do have the other two U+FFFD code points and
> do not pass the byte through there. Seems like a bug.

> For windows-874 it seems Firefox' problem is 1 again. Internet
> Explorer's problem is 2 again. And for some weird reason Chrome and
> Safari follow Internet Explorer here rather than not emitting PUA code
> points as they do for all other windows-* encodings. That also seems
> like a bug, though if there's a particular reason that would be
> interesting to know.
I don't think that you can call either of the two issues a bug per se.
They're different behaviors and what we're dealing with is not 'correct vs
incorrect' issue, but how to get to a 'consensus' out of different
implementations with a lot of historical baggage. I'm afraid even seemingly
single-byte-encodings are not as straightforward to standardize as the
current encoding spec seems to assume.

Actually, I was surprised to see Richard's test results have only two
'yellows' for Chrome/Opera that mostly use ICU's default conversion rules
for most of the single byte encodings.  Then, I realized that his tests
only tests 'decoding', but 'encoding' is another bag of worms.  For
instance,  windows-874-2000.ucm in ICU
a number of entries tagged with '| 1' meaning that they're only for
encoding.  ('|3' denotes 'decoding only').

<U0074> \x74 | 0    : round-trip

<UFF54> \x74 |1   : encoding-only # full-width Latin Small Letter T

And, it appears that all windows-125x tables in ICU (I haven't checked
them all) do encode U+FFxx (full-width ASCII block) to the
corresponding ASCII-range code points. That means, Chrome, Opera and
Safari do this. IE is likely to do the same, too.

Windows-1252 :

Windows-1253 :

OTOH, encoding U+2554 (box drawing) to 0xC9 seems to be a genuine bug
because it does not make sense to encode both U+2554 and U+0E29 (Thai
character) to 0xC9 in Windows-874.

<U2554> \xC9 |1
<U0E29> \xC9 |0

I filed a bug against ICU on this issue :

> Overall, based on these (revised) tests I still don't see a compelling
> reason to change the Encoding Standard.
Not based on Richard's test results. Nonetheless,  the encoding rules for
single-byte-encoding has to be revisited, IMHO, given what I wrote above.


> --

Received on Tuesday, 2 September 2014 18:19:39 UTC