Re: [findtext] note about character counts [I18N-ISSUE-496] (#4) from Takeshi Kanai on 2015-11-12 (public-webapps-github@w3.org from November 2015)

From: Takeshi Kanai <notifications@github.com>
Date: Wed, 11 Nov 2015 23:01:18 -0800
To: w3c/findtext <findtext@noreply.github.com>
Message-ID: <w3c/findtext/issues/4/156019050@github.com>

Here are the test results of newly introduced String functions in ES6.

`var yoshinoya = "𠮷野屋";`
The string consists of three letters. The first letter is in Unicode BMP. It means it is not possible to describe within 16bits.

`var identical = yoshinoya === String.fromCodePoint(0x20BB7, 0x91ce, 0x5c4b) ? "yes" : "no";  /// yes`
`var identical = yoshinoya === String.fromCharCode(0xd842, 0xdfb7, 0x91ce, 0x5c4b) ? "yes" : "no"; /// yes`
I guess fromCodePoint() is a function which splits each arg (> 0x10ffff) in two, and throw the args to fromCharCode(). Then it generates code unit basis String object regardless where it is from.

`yoshinoya.codePointAt(0).toString(16); /// 20bb7`
`yoshinoya.charCodeAt(0).toString(16); /// d842`
Looks good.

`yoshinoya.codePointAt(1).toString(16); /// dfb7 !!!`
`yoshinoya.charCodeAt(1).toString(16); /// dfb7`

Not good. I was expecting code-point basis indexing for codePointAt(). It appears to me it is still on code-unit basis indexing.

Regarding Editing distance, I think codePointAt() would work for it, but it calls for a custom indexing which shifts index in case an obtained code is in specific ranges, such as codes in Low Surrogate.


---
Reply to this email directly or view it on GitHub:
https://github.com/w3c/findtext/issues/4#issuecomment-156019050

Received on Thursday, 12 November 2015 07:02:27 UTC