Suggested indic/arabic edits for string search

hi Addison,

Here are some notes which i hope indicate ways i was thinking you could 
restructure the text at https://aphillips.github.io/string-search/ and 
incorporate some of the examples provided by the folks in India.  They 
aren't complete, because i wanted to stop in case you didn't like the 
direction i was pointing, so i may need to provide additional support later.

ri

---

2.1.9 Optional characters

In some orthographies it is necessary to match strings with different 
numbers of characters.

A prime example of this involves vowel diacritics in abjads.  For 
example, some languages that use the Arabic and Hebrew scripts do not 
require (but optionally allow) the user to input short vowels. (For some 
other languages in these scripts, the inclusion of the short vowels is 
not optional.) The presence or absence of vowels in the text being input 
or searched might impede a match if the user doesn't enter or know to 
enter them.

Example 7
Arabic, Persian, and Urdu users generally do not enter short vowels—but 
some texts do include them. Searching is affected by this, but meaning 
generally is not. A generalized description of this might be "optional 
to encode" sequences.



2.1.10 Visually identical text that is not canonically equivalent

Some languages have graphemes which can be encoded in more than one way. 
In some cases, these variations are handled by Unicode Normalization, 
but in other cases they are not considered equivalent by Unicode, even 
if they appear visually to be identical. Sometimes these variations are 
considered to be valid spelling variations. In other cases they are the 
result of user's mistaken perception.

For example, a number languages, such Kashmiri (language tag ks), are 
written in the Arabic script but are unrelated to the Arabic language. 
These languages thus sometimes require character sequences to represent 
sounds not present in Arabic. A significant problem for some of these 
languages is that the specially-encoded character sequences can be 
visually similar (or identical) to other character sequences and users 
may experience difficulty entering or knowing how to enter the correct 
sequence, such as when inputting a term to search for:

<examples of xxx vs xxx here>

Users may also create visually identical (or very similar) graphemes 
from sequences of characters that are deprecated or unexpected by the 
Unicode Standard. For example, in some fonts it is possible to create 
something that looks like the independent vowel /au/ using the (normal) 
<span class="codepoint" translate="no"><bdi lang="ta">&#x0B94;</bdi> [<a 
href="/scripts/tamil/block#char0B94"><span class="uname">U+0B94 TAMIL 
LETTER AU</span></a>]</span>, or by typing two inappropriate individual 
letters, <span class="codepoint" translate="no"><bdi 
lang="ta">&#x0B92;&#x0BB3;</bdi> [<a 
href="/scripts/tamil/block#char0B92"><span class="uname">U+0B92 TAMIL 
LETTER O</span></a> + <a href="/scripts/tamil/block#char0BB3"><span 
class="uname">U+0BB3 TAMIL LETTER LLA</span></a>]</span>.  The latter 
should by avoided by users, but applications will need to decide whether 
or not to match such aberrations if they appear in the text.



ADD TO 2.1.6

The spelling variants for US vs UK English are mostly standardised, 
however sometimes the spelling is down to personal preferences (or 
sometimes lack of knowledge).  For example, the US English word 
'through' can be spelled 'thru'.

Indian languages also have many instances of this kind of problem.  
Again, sometimes this is down to misspellings, but in other cases either 
spelling is acceptable.

For example, the Bengali language (language tag bn) is notorious for 
having a wide range of spelling variations permitted by the language: 
nearly 80% of Bengali words have at least two spellings. Many words have 
3, 4, or more variations—with at least one word having 16 different 
valid spellings.

One example is the word which transliterates to the Latin script as 
rani, but which users may spell with different letters and vowel marks. 
In modern Bengali <span class="codepoint" translate="no"><bdi 
lang="bn">&#x09A3;</bdi> [<a 
href="/scripts/bengali/block#char09A3"><span class="uname">U+09A3 
BENGALI LETTER NNA</span></a>]</span> and <span class="codepoint" 
translate="no"><bdi lang="bn">&#x09A8;</bdi> [<a 
href="/scripts/bengali/block#char09A8"><span class="uname">U+09A8 
BENGALI LETTER NA</span></a>]</span> are pronounced /n/, and <span 
class="codepoint" translate="no"><bdi lang="bn">&#x09BF;</bdi> [<a 
href="/scripts/bengali/block#char09BF"><span class="uname">U+09BF 
BENGALI VOWEL SIGN I </span></a>]</span> and <span class="codepoint" 
translate="no"><bdi lang="bn">&#x09C0;</bdi> [<a 
href="/scripts/bengali/block#char09C0"><span class="uname">U+09C0 
BENGALI VOWEL SIGN II </span></a>]</span> are both pronounced /i/. 
Therefore different users may choose the following alternative code 
point sequences for the same word.

Other Indic scripts provide alternative mechanisms for representing 
particular sounds, and in most cases either approach is considered 
equally valid. The most common instance of this involves representation 
of syllable-final nasals.

For example, the 'n' in the word 'Hindi' in Hindi can be written using 
either a nasal consonant (a half glyph form when part of a conjunct), or 
using a diacritic. Both of the following are possible:

<show the 'hindi' examples here>

In an additional twist to this story, two diacritics with different code 
points could be used here.  In our previous example we used <span 
class="codepoint" translate="no"><bdi lang="hi">&#x0902;</bdi> [<a 
href="/scripts/devanagari/block#char0902"><span class="uname">U+0902 
DEVANAGARI SIGN ANUSVARA </span></a>]</span> to represent the nasal 
sound because the accompanying vowel-sign rises above the hanging 
baseline. If the vowel-sign was one that didn't rise above the hanging 
baseline, we would normally use <span class="codepoint" 
translate="no"><bdi lang="hi">&#x0901;</bdi> [<a 
href="/scripts/devanagari/block#char0901"><span class="uname">U+0901 
DEVANAGARI SIGN CANDRABINDU </span></a>]</span> instead.  The function 
of both of these diacritics is the same, but their code points are 
different. <not sure whether this para is relevant, since it probably 
doesn't affect search….?>

The alternative use of letter or diacritic for syllable-final nasals is 
common to many other Indian languages.  For additional examples, see XXXX.






ADD TO 2.1.2
In many complex scripts it is possible to encode letters or vowel-signs 
in more than one way, but the alternatives are canonically equivalent.

Received on Wednesday, 2 November 2022 17:40:55 UTC