- From: Addison Phillips <addisoni18n@gmail.com>
- Date: Wed, 2 Nov 2022 16:35:38 -0700
- To: "'r12a'" <ishida@w3.org>
- Cc: "'Internationalization Working Group'" <public-i18n-core@w3.org>
- Message-ID: <013601d8ef13$d1078e20$7316aa60$@gmail.com>
Hi Richard, My internet was out all day (until just now), so I was slow to see/respond to this. As usual, great stuff and I will start to incorporate it. Addison From: r12a <ishida@w3.org> Sent: Wednesday, November 2, 2022 10:41 AM To: Addison Phillips <addisonI18N@gmail.com> Cc: Internationalization Working Group <public-i18n-core@w3.org> Subject: Suggested indic/arabic edits for string search hi Addison, Here are some notes which i hope indicate ways i was thinking you could restructure the text at https://aphillips.github.io/string-search/ and incorporate some of the examples provided by the folks in India. They aren't complete, because i wanted to stop in case you didn't like the direction i was pointing, so i may need to provide additional support later. ri --- 2.1.9 Optional characters In some orthographies it is necessary to match strings with different numbers of characters. A prime example of this involves vowel diacritics in abjads. For example, some languages that use the Arabic and Hebrew scripts do not require (but optionally allow) the user to input short vowels. (For some other languages in these scripts, the inclusion of the short vowels is not optional.) The presence or absence of vowels in the text being input or searched might impede a match if the user doesn't enter or know to enter them. Example 7 Arabic, Persian, and Urdu users generally do not enter short vowels—but some texts do include them. Searching is affected by this, but meaning generally is not. A generalized description of this might be "optional to encode" sequences. 2.1.10 Visually identical text that is not canonically equivalent Some languages have graphemes which can be encoded in more than one way. In some cases, these variations are handled by Unicode Normalization, but in other cases they are not considered equivalent by Unicode, even if they appear visually to be identical. Sometimes these variations are considered to be valid spelling variations. In other cases they are the result of user's mistaken perception. For example, a number languages, such Kashmiri (language tag ks), are written in the Arabic script but are unrelated to the Arabic language. These languages thus sometimes require character sequences to represent sounds not present in Arabic. A significant problem for some of these languages is that the specially-encoded character sequences can be visually similar (or identical) to other character sequences and users may experience difficulty entering or knowing how to enter the correct sequence, such as when inputting a term to search for: <examples of xxx vs xxx here> Users may also create visually identical (or very similar) graphemes from sequences of characters that are deprecated or unexpected by the Unicode Standard. For example, in some fonts it is possible to create something that looks like the independent vowel /au/ using the (normal) <span class="codepoint" translate="no"><bdi lang="ta">ஔ</bdi> [<a href="/scripts/tamil/block#char0B94"><span class="uname">U+0B94 TAMIL LETTER AU</span></a>]</span>, or by typing two inappropriate individual letters, <span class="codepoint" translate="no"><bdi lang="ta">ஒள</bdi> [<a href="/scripts/tamil/block#char0B92"><span class="uname">U+0B92 TAMIL LETTER O</span></a> + <a href="/scripts/tamil/block#char0BB3"><span class="uname">U+0BB3 TAMIL LETTER LLA</span></a>]</span>. The latter should by avoided by users, but applications will need to decide whether or not to match such aberrations if they appear in the text. ADD TO 2.1.6 The spelling variants for US vs UK English are mostly standardised, however sometimes the spelling is down to personal preferences (or sometimes lack of knowledge). For example, the US English word 'through' can be spelled 'thru'. Indian languages also have many instances of this kind of problem. Again, sometimes this is down to misspellings, but in other cases either spelling is acceptable. For example, the Bengali language (language tag bn) is notorious for having a wide range of spelling variations permitted by the language: nearly 80% of Bengali words have at least two spellings. Many words have 3, 4, or more variations—with at least one word having 16 different valid spellings. One example is the word which transliterates to the Latin script as rani, but which users may spell with different letters and vowel marks. In modern Bengali <span class="codepoint" translate="no"><bdi lang="bn">ণ</bdi> [<a href="/scripts/bengali/block#char09A3"><span class="uname">U+09A3 BENGALI LETTER NNA</span></a>]</span> and <span class="codepoint" translate="no"><bdi lang="bn">ন</bdi> [<a href="/scripts/bengali/block#char09A8"><span class="uname">U+09A8 BENGALI LETTER NA</span></a>]</span> are pronounced /n/, and <span class="codepoint" translate="no"><bdi lang="bn">ি</bdi> [<a href="/scripts/bengali/block#char09BF"><span class="uname">U+09BF BENGALI VOWEL SIGN I </span></a>]</span> and <span class="codepoint" translate="no"><bdi lang="bn">ী</bdi> [<a href="/scripts/bengali/block#char09C0"><span class="uname">U+09C0 BENGALI VOWEL SIGN II </span></a>]</span> are both pronounced /i/. Therefore different users may choose the following alternative code point sequences for the same word. Other Indic scripts provide alternative mechanisms for representing particular sounds, and in most cases either approach is considered equally valid. The most common instance of this involves representation of syllable-final nasals. For example, the 'n' in the word 'Hindi' in Hindi can be written using either a nasal consonant (a half glyph form when part of a conjunct), or using a diacritic. Both of the following are possible: <show the 'hindi' examples here> In an additional twist to this story, two diacritics with different code points could be used here. In our previous example we used <span class="codepoint" translate="no"><bdi lang="hi">ं</bdi> [<a href="/scripts/devanagari/block#char0902"><span class="uname">U+0902 DEVANAGARI SIGN ANUSVARA </span></a>]</span> to represent the nasal sound because the accompanying vowel-sign rises above the hanging baseline. If the vowel-sign was one that didn't rise above the hanging baseline, we would normally use <span class="codepoint" translate="no"><bdi lang="hi">ँ</bdi> [<a href="/scripts/devanagari/block#char0901"><span class="uname">U+0901 DEVANAGARI SIGN CANDRABINDU </span></a>]</span> instead. The function of both of these diacritics is the same, but their code points are different. <not sure whether this para is relevant, since it probably doesn't affect search….?> The alternative use of letter or diacritic for syllable-final nasals is common to many other Indian languages. For additional examples, see XXXX. ADD TO 2.1.2 In many complex scripts it is possible to encode letters or vowel-signs in more than one way, but the alternatives are canonically equivalent.
Received on Wednesday, 2 November 2022 23:35:55 UTC