Re: Tests for Encoding spec from 신정식 on 2015-10-22 (www-international@w3.org from October to December 2015)

From: 신정식 <jshin1987+w3@gmail.com>
Date: Wed, 21 Oct 2015 23:38:47 -0700
To: Anne van Kesteren <annevk@annevk.nl>
Cc: Richard Ishida <ishida@w3.org>, www International <www-international@w3.org>
Message-ID: <CAE1ONj8=v7P1B14au8GjjoJzfeEsRmF85pQ46x6dQOU9zapSfA@mail.gmail.com>

On Tue, Oct 20, 2015 at 1:19 AM, Jungshik SHIN (신정식) <jshin1987+w3@gmail.com
> wrote:

>
>
> On Mon, Oct 19, 2015 at 11:45 AM, Jungshik SHIN (신정식) <
> jshin1987+w3@gmail.com> wrote:
>
>>
>>
>> On Mon, Oct 19, 2015 at 5:27 AM, Anne van Kesteren <annevk@annevk.nl>
>> wrote:
>>
>>> On Mon, Oct 19, 2015 at 2:03 PM, Richard Ishida <ishida@w3.org> wrote:
>>> > 1. i'd be happy to change the mechanism for identifying the output of
>>> > encoding if i knew how.  The problem, it seems to me, with generating
>>> form
>>> > submissions is that if you are not looking at the percent escapes
>>> themselves
>>> > (ie. comparing within the document, by which time the form submission
>>> > parameter has been converted to Unicode) you are reliant on decoding
>>> to work
>>> > for encoding results to be reliable.  It's ok to check the odd
>>> character
>>> > visually by checking the web address bar, but how to do that for tens
>>> of
>>> > thousands of characters?  I'd be very happy to know if you have a
>>> > suggestion.
>>>
>>> If you use application/x-www-form-urlencoded (the default) there will
>>> be no Unicode involved. Just percent-encoded bytes. So if you have
>>> something on the server that doesn't decode for you, you should be
>>> able to get at the raw bytes the browser used to encode.
>>>
>>>
>>>
>> Richard, you can look at what Blink/Webkit's layout tests handle this
>> issue:
>>
>>
>> https://code.google.com/p/chromium/codesearch#chromium/src/third_party/WebKit/LayoutTests/fast/encoding/char-encoding.html
>>
>> The test only checks only a handful of code points, but I guess it can be
>> expanded to cover all the code points. Anyway, it can be a starting point.
>>
>>
>>
>>> > 2. i suspect that its' actually important for the mechanism of
>>> converting to
>>> > href values to work too, so i think that this may still be something
>>> that
>>> > needs fixing.  If what goes into the href value is not what the user
>>> > expected, then that is presumably problematic.
>>>
>>> Yeah, both should definitely work in the end. Everything needs to
>>> become predictable for developers.
>>>
>>
>> I agree. After sending my last email, I took a look at Richard's test and
>> found that out. I'll find out where href got wrong in Chrome and try to
>> fix.
>>
>
> In Chrome's DOM Inspector JS  console, everything is fine (no NFC
> applied).
>
> > var a=document.createElement("a")
> undefined
> a
> <a></a>
> > a.href="https://example.com/?x" + "樂樂" + "x"
> "https://example.com/?x樂樂x"
> > a.search.substr(1)
> "x%E6%A8%82%EF%A4%94x"
>
> It's also fine when the document encoding is UTF-8 (two characters above
> do not lose their 'identity' folded into one).
>
> However, in EUC-KR, the distinction between them is lost apparently
> because they're subject to NFC.
>
> I've just filed a Chrome bug :
> https://code.google.com/p/chromium/issues/detail?id=545383
>

This bug was fixed. With Chrome's canary build ('nightly' build), EUC-KR
encoding is 100%. SJIS and EUC-JP encoding failed only one code point
(which will be fixed when I update the Chrome's mapping table per the
latest spec).
Big5 encoding failed only 4 code points (ditto with SJIS/EUC-JP).

Jungshik



>
> Jungshik
>
>
>
>> Jungshik
>>
>>
>>
>>>
>>>
>>> --
>>> https://annevankesteren.nl/
>>>
>>
>>
>

Received on Thursday, 22 October 2015 06:39:17 UTC