Re: Encoding single-byte tests from Martin J. Dürst on 2014-08-30 (www-international@w3.org from July to September 2014)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Sat, 30 Aug 2014 19:20:19 +0900
To: Richard Ishida <ishida@w3.org>, www International <www-international@w3.org>, Anne van Kesteren <annevk@annevk.nl>, Philippe Le Hegaret <plh@w3.org>
Message-ID: <5401A563.409@it.aoyama.ac.jp>
Hello Richard,

Thanks for finding the reason for the differences between mine and your 
tests.

 > Your test files provided lines for those missing from the index, eg.
 >
 >          <td class='test' title='0xDB->U+FFFD'></td>

 > and indicate that you expect to get U+FFFD as a result, but other
 > characters appear as the textContent of the td, rather than U+FFFD.
 > I don't know where you got those characters from.

The textContent of the <td> is the *byte* value (0xDB in the example) 
that's tested. If there's no entry in the index, and if my 
interpretation of the spec 
(http://encoding.spec.whatwg.org/#single-byte-decoder (point 4) and 
http://encoding.spec.whatwg.org/#error) is correct, then the browser 
should convert this to U+FFFD. So that's what the second half of the 
text attribute is saying.

 >
 > In my test files, I use &#xFFFD; for the textContent of the td, eg.
 >
 > "<td class='test' title='0xdb->U+FFFD'>&#xFFFD;</td>"+

 > and that appears to match the behaviour of the browsers.

I strongly hope that every browser we test will convert &#xFFFD; to 
U+FFFD. But that's HTML syntax, and not part of the encoding spec. So we 
shouldn't test it, and in particlar shouldn't test it multiple times.

On the other hand, testing that the browser uses U+FFFD when the 
Encoding spec says so makes sense, and that's what I have done.

Regards,   Martin.


On 2014/08/29 20:55, Richard Ishida wrote:
> On 29/08/2014 11:51, "Martin J. Dürst" wrote:
>> On 2014/08/28 18:59, Richard Ishida wrote:
>>> On 28/08/2014 10:25, "Martin J. Dürst" wrote:
>>
>>>> On 2014/08/24 01:39, Richard Ishida wrote:
>>
>>>>> Those of you who saw that page before should note that the results are
>>>>> now slightly different. I haven't tracked down the cause, but I
>>>>> suspect
>>>>> that silent codepoint changes in my editor were to blame for the
>>>>> initial
>>>>> discrepancies.
>>
>> The differences between your earlier version of the tests and your later
>> version of the tests can be explained that way.
>>
>>>> I have tried to find such a case. I found that for windows-1253, my
>>>> tests give "expected "U+FFFD" but got "ª" (U+00AA)" for 0xAA, but
>>>> Chrome
>>>> is listed green for windows-1253 (incl. aliases) at
>>>> http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases.
>>>>
>>>>
>>>>
>>>> My version of Chrome is "37.0.2062.94 m". I also found 8 errors for my
>>>> tests on windows-874.
>>
>> This difference (i.e. between my tests and your later version of the
>> tests) still remains unexplained.
>
> There are no correspondences listed in the Encoding index for those 8
> codepoints.  The file
> http://encoding.spec.whatwg.org/index-windows-874.txt has two adjacent
> lines:
>
>   90    0x0E3A    ฺ (THAI CHARACTER PHINTHU)
>   95    0x0E3F    ฿ (THAI CURRENCY SYMBOL BAHT)
>
> and ends at line 123.  This leaves an overall gap of 8 lines for which
> no correspondence is listed.
>
> Your test files provided lines for those missing from the index, eg.
>
>          <td class='test' title='0xDB->U+FFFD'></td>
>          <td class='test' title='0xDC->U+FFFD'></td>
>          <td class='test' title='0xDD->U+FFFD'></td>
>          <td class='test' title='0xDE->U+FFFD'></td>
>
> and
>
>          <td class='test' title='0xFC->U+FFFD'></td>
>          <td class='test' title='0xFD->U+FFFD'></td>
>          <td class='test' title='0xFE->U+FFFD'></td>
>          <td class='test' title='0xFF->U+FFFD'></td>
>
> and indicate that you expect to get U+FFFD as a result, but other
> characters appear as the textContent of the td, rather than U+FFFD.  I
> don't know where you got those characters from.
>
> In my test files, I use &#xFFFD; for the textContent of the td, eg.
>
> "<td class='test' title='0xdb->U+FFFD'>&#xFFFD;</td>"+
> "<td class='test' title='0xdc->U+FFFD'>&#xFFFD;</td>"+
> "<td class='test' title='0xdd->U+FFFD'>&#xFFFD;</td>"+
> "<td class='test' title='0xde->U+FFFD'>&#xFFFD;</td>"+
>
> and that appears to match the behaviour of the browsers.
>
>
> I think the windows-1253 problem you mention results from the same
> circumstance.  The Encoding index file has no line for pointer 42.
>
> 41    0x00A9    © (COPYRIGHT SIGN)
> 43    0x00AB    « (LEFT-POINTING DOUBLE ANGLE QUOTATION MARK)
>
> Your file says:
>
>          <td class='test' title='0xAA->U+FFFD'>ª</td>
>
> Does that solve the mystery?
>
> RI
>
>
>
>
Received on Saturday, 30 August 2014 10:21:01 UTC