Re: [string-search] Updates including Kashmiri examples from Richard (#14)

> Users may also create visually identical (or very similar) graphemes from sequences of characters that are deprecated or unexpected by the Unicode Standard. For example, in some fonts it is possible to create something that looks like the independent vowel /au/ using the (normal) ஔ [[U+0B94 TAMIL LETTER AU](https://deploy-preview-14--string-search-w3c.netlify.app/scripts/tamil/block#char0B94)], or by typing two inappropriate individual letters, ஒள [[U+0B92 TAMIL LETTER O](https://deploy-preview-14--string-search-w3c.netlify.app/scripts/tamil/block#char0B92) + [U+0BB3 TAMIL LETTER LLA](https://deploy-preview-14--string-search-w3c.netlify.app/scripts/tamil/block#char0BB3)]. The latter should by avoided by users, but applications will need to decide whether or not to match such aberrations if they appear in the text.

The alternatives you use as examples are neither deprecated nor unexpected by Unicode. They are canonically equivalent precomposed vs decomposed instances.  So, even if Unicode expresses a preference for precomposed it's not really essential to choose one rather than the other, and normalisation will resolve the matching problem.

A much better example would be something like the vowel-sign constructions shown at https://r12a.github.io/scripts/devanagari/hi.html#vowelsign_encoding, https://r12a.github.io/scripts/bengali/bn.html#vowelsign_encoding2 or https://r12a.github.io/scripts/malayalam/ml.html#vowelsign_encoding2 (the first one at the last link is particularly good)

-- 
GitHub Notification of comment by r12a
Please view or discuss this issue at https://github.com/w3c/string-search/pull/14#issuecomment-1330910195 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Tuesday, 29 November 2022 16:25:34 UTC