Re: [string-search] Requirements for Indian languages (#10) from Prashant Verma via GitHub on 2022-02-10 (public-i18n-archive@w3.org from January to March 2022)

From: Prashant Verma via GitHub <sysbot+gh@w3.org>
Date: Thu, 10 Feb 2022 05:22:47 +0000
To: public-i18n-archive@w3.org
Message-ID: <issue_comment.created-1034505382-1644470565-sysbot+gh@w3.org>
Dear Richard,

Greetings..

Thanks for sharing valuable inputs. As it was a long time back, We have
already revised the character model requirements document with additional 5
more languages requirements. These are collected from the various Language
Experts. Also we will go through your comments and revise documents
accordingly wherever required. I will share it with you soon.

Thanks,

Prashant

On Wed, Feb 9, 2022 at 7:37 AM r12a ***@***.***> wrote:

> @vermaprashant1 <https://github.com/vermaprashant1> sorry it's taken me
> so long to get to this. Here are my comments.
>
> [1] I think the document would be much clearer if at the beginning you
> separated out more cleanly the various ways in which words can be encoded
> differently, and then *after that* point out the consequences and
> proposed advice. I would start with a list of problem cases that would
> include the following:
>
>    1. spelling variants such as the alternation between syllable-final
>    /n/ or nasalisation (eg. the word Hindi) – note that spelling variants
>    occur in most languages, and so it's something any search engine typically
>    has to consider - what other common alternative spellings occur in Hindi
>    besides LA vs LLA (which you mention almost in passing without any
>    examples)? It would be good to have a list of at least the more common ones.
>    2. the choice of characters to represent nuktas (with a little more
>    detail) – this is a little complicated in Devanagari because normalisation
>    produces different results for different visual combinations, see
>    https://r12a.github.io/scripts/devanagari/#nukta_encoding
>    3. inappropriate combinations that look the same visually – you don't
>    mention these at all, but it's a significant issue for indic scripts. See
>    examples of this for vowel-sign and independent vowel representation at
>    https://r12a.github.io/scripts/devanagari/#vowelsign_encoding and
>    https://r12a.github.io/scripts/devanagari/#vowelsign_encoding2
>    4. any combinations of combining characters with a single base that
>    can be typed and stored in an order that causes problems - often this is
>    resolved during normalisation, but there are problematic cases that are not
>    resolved by normalising the text - similar issues are motivating some folks
>    involved with Unicode to produce rendering guidelines for Thai, Khmer and
>    Arabic scripts - these advise reordering of specific sequences of
>    characters so as to produce consistent ordering and ensure that the text
>    renders correctly when displayed. Again, you don't mention any such
>    combinations, and i haven't researched this either yet for Devanagari.
>    5. Matching needs to decide what to do when format characters appear
>    in text, eg. ZWJ, ZWNJ. In languages like Persian, these can affect the
>    semantics of the text, but i suspect that in Devanagari that is not the
>    case, and they can just be ignored. It's worth checking the full list of
>    invisible characters that may appear in Devanagari text.
>    6. Graphically similar but semantically different (confusable) code
>    points - i would probably put the OM in this category.
>
> Such an analysis would need to indicate which alternations in sequence are
> handled by normalisation. Normalisation should be expected as a given,
> always, before matching, so it's the ones that normalisation doesn't fix
> that we are particularly interested in.
>
> It would be interesting to explore whether what equivalences need to be
> made for string matching of identifiers (eg. the HTML/CSS case) vs. full
> text search. For example, in english spelling differences such as
> 'internationalization' vs. 'internationalisation' are not seen as
> equivalent, and maybe the anusvara-conjunct alternate is the same. In full
> text search, however, searching for one should probably find the other.
>
> [2] Section 2.2.
>
> It is requires by the Unicode to store and interchanged the characters in
> the same logical order or we can say that order that user typed through the
> keyboards
>
> The initial sentence gives the impression that the Unicode Standard
> requires that users type keys on the keyboard in a particular order. What
> the standard actually says is that the stored order of characters
> *typically* corresponds to the order in which they are typed, but there
> is no expectation at all about how the keyboard should actually function,
> as long as it produces an appropriate sequencing of characters in the end:
> combining marks after base characters, virama between conjunct parts, etc.
>
> Given that, i'm not sure what point you want to make in section 2.2. Any
> decent keyboard should allow the user to produce good Unicode character
> sequences, and any kbd that doesn't should be avoided.
>
> [3] Are there different concerns for other languages using devanagari? -
> eg. i'm thinking about the eye-lash RA in Marathi.
>
> [4] It would be very much easier for me to review your document if it was
> available in HTML, rather than PDF form. I'd be able to make annotations on
> the document for my reference, and i'd be able to copy-paste examples for
> exploration without the junk that PDF produces.
>
> hope that helps.
>
> —
> Reply to this email directly, view it on GitHub
> <https://github.com/w3c/string-search/issues/10#issuecomment-1033893471>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AB7B5ESYROMPOCRINGJKHV3U2KC43ANCNFSM4ZJ77KSQ>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>


-- 
Thanks & Regards,

Prashant Verma I  Program Manager
Web Standardization Initiative(WSI) , MeitY
New Delhi
Cell : +91-8800521042
Website : *http://tdil.meity.gov.in/WSI/AboutWSI.aspx
<http://tdil.mit.gov.in/WSI/AboutWSI.aspx>*


-- 
GitHub Notification of comment by vermaprashant1
Please view or discuss this issue at https://github.com/w3c/string-search/issues/10#issuecomment-1034505382 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config
Received on Thursday, 10 February 2022 05:22:51 UTC