Re: [string-search] Requirements for Indian languages (#10)

@vermaprashant1 sorry it's taken me so long to get to this.  Here are my comments.

[1] I think the document would be much clearer if at the beginning you separated out more cleanly the various ways in which words can be encoded differently, and then _after that_ point out the consequences and proposed advice. I would start with a list of problem cases that would include the following:
1. spelling variants such as the alternation between syllable-final /n/ or nasalisation (eg. the word Hindi) – note that spelling variants occur in most languages, and so it's something any search engine typically has to consider - what other common alternative spellings occur in Hindi besides LA vs LLA (which you mention almost in passing without any examples)?  It would be good to have a list of at least the more common ones.
2. the choice of characters to represent nuktas (with a little more detail) – this is a little complicated in Devanagari because normalisation produces different results for different visual combinations, see https://r12a.github.io/scripts/devanagari/#nukta_encoding
3. inappropriate combinations that look the same visually – you don't mention these at all, but it's a significant issue for indic scripts.  See examples of this for vowel-sign and independent vowel representation at https://r12a.github.io/scripts/devanagari/#vowelsign_encoding and https://r12a.github.io/scripts/devanagari/#vowelsign_encoding2
4. any combinations of combining characters with a single base that can be typed and stored in an order that causes problems - often this is resolved during normalisation, but there are problematic cases that are not resolved by normalising the text - similar issues are motivating some folks involved with Unicode to produce rendering guidelines for Thai, Khmer and Arabic scripts - these advise reordering of specific sequences of characters so as to produce consistent ordering and ensure that the text renders correctly when displayed.  Again, you don't mention any such combinations, and i haven't researched this either yet for Devanagari.
5. Matching needs to decide what to do when format characters appear in text, eg. ZWJ, ZWNJ.  In languages like Persian, these can affect the semantics of the text, but i suspect that in Devanagari that is not the case, and they can just be ignored.  It's worth checking the full list of invisible characters that may appear in Devanagari text.
6. Graphically similar but semantically different (confusable) code points - i would probably put the OM in this category.


Such an analysis would need to indicate which alternations in sequence are handled by normalisation. Normalisation should be expected as a given, always, before matching, so it's the ones that normalisation doesn't fix that we are particularly interested in.

It would be interesting to explore whether what equivalences need to be made for string matching of identifiers (eg. the HTML/CSS case) vs. full text search.  For example, in english spelling differences such as 'internationalization' vs. 'internationalisation' are not seen as equivalent, and maybe the anusvara-conjunct alternate is the same.  In full text search, however, searching for one should probably find the other.

[2] Section 2.2.
> It is requires by the Unicode to store and interchanged the characters in the same logical order or we can say that order that user typed through the keyboards

The initial sentence gives the impression that the Unicode Standard requires that users type keys on the keyboard in a particular order. What the standard actually says is that the stored order of characters _typically_ corresponds to the order in which they are typed, but there is no expectation at all about how the keyboard should actually function, as long as it produces an appropriate sequencing of characters in the end: combining marks after base characters, virama between conjunct parts, etc.

Given that, i'm not sure what point you want to make in section 2.2. Any decent keyboard should allow the user to produce good Unicode character sequences, and any kbd that doesn't should be avoided.

[3] Are there different concerns for other languages using devanagari? - eg. i'm thinking about the eye-lash RA in Marathi.

[4] It would be very much easier for me to review your document if it was available in HTML, rather than PDF form.  I'd be able to make annotations on the document for my reference, and i'd be able to copy-paste examples for exploration without the junk that PDF produces.

hope that helps.

-- 
GitHub Notification of comment by r12a
Please view or discuss this issue at https://github.com/w3c/string-search/issues/10#issuecomment-1033893471 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Wednesday, 9 February 2022 15:37:41 UTC