Re: Encoding single-byte tests from Richard Ishida on 2014-08-29 (www-international@w3.org from July to September 2014)

From: Richard Ishida <ishida@w3.org>
Date: Fri, 29 Aug 2014 12:55:02 +0100
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, www International <www-international@w3.org>, Anne van Kesteren <annevk@annevk.nl>, Philippe Le Hegaret <plh@w3.org>
Message-ID: <54006A16.6030109@w3.org>

On 29/08/2014 11:51, "Martin J. Dürst" wrote:
> On 2014/08/28 18:59, Richard Ishida wrote:
>> On 28/08/2014 10:25, "Martin J. Dürst" wrote:
>
>>> On 2014/08/24 01:39, Richard Ishida wrote:
>
>>>> Those of you who saw that page before should note that the results are
>>>> now slightly different. I haven't tracked down the cause, but I suspect
>>>> that silent codepoint changes in my editor were to blame for the
>>>> initial
>>>> discrepancies.
>
> The differences between your earlier version of the tests and your later
> version of the tests can be explained that way.
>
>>> I have tried to find such a case. I found that for windows-1253, my
>>> tests give "expected "U+FFFD" but got "ª" (U+00AA)" for 0xAA, but Chrome
>>> is listed green for windows-1253 (incl. aliases) at
>>> http://www.w3.org/International/tests/repository/encoding/indexes/results-aliases.
>>>
>>>
>>> My version of Chrome is "37.0.2062.94 m". I also found 8 errors for my
>>> tests on windows-874.
>
> This difference (i.e. between my tests and your later version of the
> tests) still remains unexplained.

There are no correspondences listed in the Encoding index for those 8 
codepoints.  The file 
http://encoding.spec.whatwg.org/index-windows-874.txt has two adjacent 
lines:

  90 0x0E3A ฺ (THAI CHARACTER PHINTHU)
  95 0x0E3F ฿ (THAI CURRENCY SYMBOL BAHT)

and ends at line 123.  This leaves an overall gap of 8 lines for which 
no correspondence is listed.

Your test files provided lines for those missing from the index, eg.

         <td class='test' title='0xDB->U+FFFD'></td>
         <td class='test' title='0xDC->U+FFFD'></td>
         <td class='test' title='0xDD->U+FFFD'></td>
         <td class='test' title='0xDE->U+FFFD'></td>

and

         <td class='test' title='0xFC->U+FFFD'></td>
         <td class='test' title='0xFD->U+FFFD'></td>
         <td class='test' title='0xFE->U+FFFD'></td>
         <td class='test' title='0xFF->U+FFFD'></td>

and indicate that you expect to get U+FFFD as a result, but other 
characters appear as the textContent of the td, rather than U+FFFD.  I 
don't know where you got those characters from.

In my test files, I use &#xFFFD; for the textContent of the td, eg.

"<td class='test' title='0xdb->U+FFFD'>&#xFFFD;</td>"+
"<td class='test' title='0xdc->U+FFFD'>&#xFFFD;</td>"+
"<td class='test' title='0xdd->U+FFFD'>&#xFFFD;</td>"+
"<td class='test' title='0xde->U+FFFD'>&#xFFFD;</td>"+

and that appears to match the behaviour of the browsers.


I think the windows-1253 problem you mention results from the same 
circumstance.  The Encoding index file has no line for pointer 42.

41 0x00A9 © (COPYRIGHT SIGN)
43 0x00AB « (LEFT-POINTING DOUBLE ANGLE QUOTATION MARK)

Your file says:

         <td class='test' title='0xAA->U+FFFD'>ª</td>

Does that solve the mystery?

RI

Received on Friday, 29 August 2014 11:55:36 UTC