- From: Prashant Verma via GitHub <sysbot+gh@w3.org>
- Date: Thu, 10 Feb 2022 05:22:47 +0000
- To: public-i18n-archive@w3.org
Dear Richard, Greetings.. Thanks for sharing valuable inputs. As it was a long time back, We have already revised the character model requirements document with additional 5 more languages requirements. These are collected from the various Language Experts. Also we will go through your comments and revise documents accordingly wherever required. I will share it with you soon. Thanks, Prashant On Wed, Feb 9, 2022 at 7:37 AM r12a ***@***.***> wrote: > @vermaprashant1 <https://github.com/vermaprashant1> sorry it's taken me > so long to get to this. Here are my comments. > > [1] I think the document would be much clearer if at the beginning you > separated out more cleanly the various ways in which words can be encoded > differently, and then *after that* point out the consequences and > proposed advice. I would start with a list of problem cases that would > include the following: > > 1. spelling variants such as the alternation between syllable-final > /n/ or nasalisation (eg. the word Hindi) – note that spelling variants > occur in most languages, and so it's something any search engine typically > has to consider - what other common alternative spellings occur in Hindi > besides LA vs LLA (which you mention almost in passing without any > examples)? It would be good to have a list of at least the more common ones. > 2. the choice of characters to represent nuktas (with a little more > detail) – this is a little complicated in Devanagari because normalisation > produces different results for different visual combinations, see > https://r12a.github.io/scripts/devanagari/#nukta_encoding > 3. inappropriate combinations that look the same visually – you don't > mention these at all, but it's a significant issue for indic scripts. See > examples of this for vowel-sign and independent vowel representation at > https://r12a.github.io/scripts/devanagari/#vowelsign_encoding and > https://r12a.github.io/scripts/devanagari/#vowelsign_encoding2 > 4. any combinations of combining characters with a single base that > can be typed and stored in an order that causes problems - often this is > resolved during normalisation, but there are problematic cases that are not > resolved by normalising the text - similar issues are motivating some folks > involved with Unicode to produce rendering guidelines for Thai, Khmer and > Arabic scripts - these advise reordering of specific sequences of > characters so as to produce consistent ordering and ensure that the text > renders correctly when displayed. Again, you don't mention any such > combinations, and i haven't researched this either yet for Devanagari. > 5. Matching needs to decide what to do when format characters appear > in text, eg. ZWJ, ZWNJ. In languages like Persian, these can affect the > semantics of the text, but i suspect that in Devanagari that is not the > case, and they can just be ignored. It's worth checking the full list of > invisible characters that may appear in Devanagari text. > 6. Graphically similar but semantically different (confusable) code > points - i would probably put the OM in this category. > > Such an analysis would need to indicate which alternations in sequence are > handled by normalisation. Normalisation should be expected as a given, > always, before matching, so it's the ones that normalisation doesn't fix > that we are particularly interested in. > > It would be interesting to explore whether what equivalences need to be > made for string matching of identifiers (eg. the HTML/CSS case) vs. full > text search. For example, in english spelling differences such as > 'internationalization' vs. 'internationalisation' are not seen as > equivalent, and maybe the anusvara-conjunct alternate is the same. In full > text search, however, searching for one should probably find the other. > > [2] Section 2.2. > > It is requires by the Unicode to store and interchanged the characters in > the same logical order or we can say that order that user typed through the > keyboards > > The initial sentence gives the impression that the Unicode Standard > requires that users type keys on the keyboard in a particular order. What > the standard actually says is that the stored order of characters > *typically* corresponds to the order in which they are typed, but there > is no expectation at all about how the keyboard should actually function, > as long as it produces an appropriate sequencing of characters in the end: > combining marks after base characters, virama between conjunct parts, etc. > > Given that, i'm not sure what point you want to make in section 2.2. Any > decent keyboard should allow the user to produce good Unicode character > sequences, and any kbd that doesn't should be avoided. > > [3] Are there different concerns for other languages using devanagari? - > eg. i'm thinking about the eye-lash RA in Marathi. > > [4] It would be very much easier for me to review your document if it was > available in HTML, rather than PDF form. I'd be able to make annotations on > the document for my reference, and i'd be able to copy-paste examples for > exploration without the junk that PDF produces. > > hope that helps. > > — > Reply to this email directly, view it on GitHub > <https://github.com/w3c/string-search/issues/10#issuecomment-1033893471>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AB7B5ESYROMPOCRINGJKHV3U2KC43ANCNFSM4ZJ77KSQ> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > > You are receiving this because you were mentioned.Message ID: > ***@***.***> > -- Thanks & Regards, Prashant Verma I Program Manager Web Standardization Initiative(WSI) , MeitY New Delhi Cell : +91-8800521042 Website : *http://tdil.meity.gov.in/WSI/AboutWSI.aspx <http://tdil.mit.gov.in/WSI/AboutWSI.aspx>* -- GitHub Notification of comment by vermaprashant1 Please view or discuss this issue at https://github.com/w3c/string-search/issues/10#issuecomment-1034505382 using your GitHub account -- Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config
Received on Thursday, 10 February 2022 05:22:51 UTC