Re: [findtext] note about character counts [I18N-ISSUE-496]

You should be able to do Array.from(str).length to get the number of
symbols, I think.

Not sure whether .normalize() called on the string first would make a
difference in the result.

On Wed, Nov 11, 2015, 23:06 Takeshi Kanai via GitHub <sysbot+gh@w3.org>
wrote:

> Here are the test results of newly introduced String functions in ES6.
>
> `var yoshinoya = "𠮷野屋";`
> The string consists of three letters. The first letter is in Unicode
> BMP. It means it is not possible to describe within 16bits.
>
> `var identical = yoshinoya === String.fromCodePoint(0x20BB7, 0x91ce,
> 0x5c4b) ? "yes" : "no";  /// yes`
> `var identical = yoshinoya === String.fromCharCode(0xd842, 0xdfb7,
> 0x91ce, 0x5c4b) ? "yes" : "no"; /// yes`
> I guess fromCodePoint() is a function which splits each arg (>
> 0x10ffff) in two, and throw the args to fromCharCode(). Then it
> generates code unit basis String object regardless where it is from.
>
> `yoshinoya.codePointAt(0).toString(16); /// 20bb7`
> `yoshinoya.charCodeAt(0).toString(16); /// d842`
> Looks good.
>
> `yoshinoya.codePointAt(1).toString(16); /// dfb7 !!!`
> `yoshinoya.charCodeAt(1).toString(16); /// dfb7`
>
> Not good. I was expecting code-point basis indexing for codePointAt().
>  It appears to me it is still on code-unit basis indexing.
>
> Regarding Editing distance, I think codePointAt() would work for it,
> but it calls for a custom indexing which shifts index in case an
> obtained code is in specific ranges, such as codes in Low Surrogate.
>
>
> --
> GitHub Notif of comment by tkanai
> See https://github.com/w3c/findtext/issues/4#issuecomment-156019050
>
>

Received on Thursday, 12 November 2015 07:10:55 UTC