Re: [findtext] note about character counts [I18N-ISSUE-496] from Takeshi Kanai via GitHub on 2015-11-12 (public-annotation@w3.org from November 2015)

From: Takeshi Kanai via GitHub <sysbot+gh@w3.org>
Date: Thu, 12 Nov 2015 07:01:17 +0000
To: public-annotation@w3.org
Message-ID: <issue_comment.created-156019050-1447311675-sysbot+gh@w3.org>

Here are the test results of newly introduced String functions in ES6.

`var yoshinoya = "𠮷野屋";`
The string consists of three letters. The first letter is in Unicode 
BMP. It means it is not possible to describe within 16bits.

`var identical = yoshinoya === String.fromCodePoint(0x20BB7, 0x91ce, 
0x5c4b) ? "yes" : "no";  /// yes`
`var identical = yoshinoya === String.fromCharCode(0xd842, 0xdfb7, 
0x91ce, 0x5c4b) ? "yes" : "no"; /// yes`
I guess fromCodePoint() is a function which splits each arg (> 
0x10ffff) in two, and throw the args to fromCharCode(). Then it 
generates code unit basis String object regardless where it is from.

`yoshinoya.codePointAt(0).toString(16); /// 20bb7`
`yoshinoya.charCodeAt(0).toString(16); /// d842`
Looks good.

`yoshinoya.codePointAt(1).toString(16); /// dfb7 !!!`
`yoshinoya.charCodeAt(1).toString(16); /// dfb7`

Not good. I was expecting code-point basis indexing for codePointAt().
 It appears to me it is still on code-unit basis indexing.

Regarding Editing distance, I think codePointAt() would work for it, 
but it calls for a custom indexing which shifts index in case an 
obtained code is in specific ranges, such as codes in Low Surrogate.


-- 
GitHub Notif of comment by tkanai
See https://github.com/w3c/findtext/issues/4#issuecomment-156019050

Received on Thursday, 12 November 2015 07:01:20 UTC