RE: NNBSP Impact from jrmt@almas.co.jp on 2015-07-30 (public-i18n-mongolian@w3.org from July to September 2015)

From: <jrmt@almas.co.jp>
Date: Thu, 30 Jul 2015 09:09:27 +0900
To: "'Andrew West'" <andrewcwest@gmail.com>, "'Greg Eck'" <greck@postone.net>
Cc: <public-i18n-mongolian@w3.org>, "'Asmus Freytag'" <asmusf@ix.netcom.com>
Message-ID: <007401d0ca5c$009dadc0$01d90940$@almas.co.jp>
Hi All,

I am outside office and will not be able to write detailed response to all of the members discussion.

But I would like to say that when we want to solve the problem of NNBSP Mongolian.
1. Need to change NNBSP Unicode property to match Mongolian suffix requirements.
2. Need to replace NNBSP with another new character U180F.

According to Andrew West's following discussion,
> A further problem with encoding a new character is that when it is eventually supported by fonts and rendering systems, 
> Mongolian text with NNBSP and Mongolian text with the new character will look the same to end users, 
> with the result that users will start to complain that internet searches and local find/replace functions do not work correctly for 
> Mongolian because searching for a Mongolian word with the new character will not match the same word with NNBSP and vice versa.  
> And this problem will never go away, because no-one is going to magically change existing Mongolian data, 
> and input methods and users will continue to use NNBSP in place of the new character for years to come 
> -- why not? they both look the same and produce the same visual result.

Actually, there is a lot of the same visual result with different code in Mongolian Unicode. Almost 80% of word have different coded inputs in Mongolian now.
This my objection point to this proposals in the starting point. In 1990's. But it is finally accepted by ISO/IEC 10646 and Unicode, we have to bear this in Mongolian now.
For example, the word Mongolian ( ᠮᠣᠩᠭᠤᠯ  ), as least 4 different possible spelling (U18) and same visual result.
ᠮᠣᠩᠭᠤᠯ   -- (U182E+U1823+U1829+182D+1823:182F)
ᠮᠣᠩᠭᠤᠯ   -- (U182E+U1823+U1829+182D+1824:182F)
ᠮᠣᠩᠭᠤᠯ   -- (U182E+U1824+U1829+182D+1823:182F)
ᠮᠣᠩᠭᠤᠯ   -- (U182E+U1824+U1829+182D+1824:182F)
Maybe there is only one or two is correct spelling others are wrong. But users could not distinguish which is wrong!
Here I am listing this is not for solving this problem now.
What I want to say here is that the introducing new character to Mongolian for Suffix Separator will not lead so big problem to Mongolian itself actually. 

I have to leave for meeting now, let me write more on this evening.

Thanks and Best Regards,

Jirimutu
==========================================================
Almas Inc.
101-0021 601 Nitto-Bldg, 6-15-11, Soto-Kanda, Chiyoda-ku, Tokyo
E-Mail: jrmt@almas.co.jp   Mobile : 090-6174-6115
Phone : 03-5688-2081,   Fax : 03-5688-2082
http://www.almas.co.jp/   http://www.compiere-japan.com/
==========================================================



-----Original Message-----
From: Andrew West [mailto:andrewcwest@gmail.com] 
Sent: Wednesday, July 29, 2015 5:42 PM
To: Greg Eck
Cc: public-i18n-mongolian@w3.org; Asmus Freytag
Subject: Re: NNBSP Impact

Hi Greg,

On 29 July 2015 at 04:26, Greg Eck <greck@postone.net> wrote:
>
> 1.)     NNBSP gives the following problems in the current Mongolian script
> Utilities functionality
>
> -          Considered to be a space in the case of most programming
> languages and embedded routines and therefore gives undesired results 
> in parsing processes

Could you explain (with concrete examples if possible) exactly what undesirable results result from NNBSP being a space character?

> -          Breaks the word as seen in word counting, word jumping, sorting,
> parsing

I can understand the issues with word selection, word counting and word navigation, which I have verified exist in some software, notably Word (but not all software -- Notepad and BabelPad both behave as desired), but I am not sure what specific issue "parsing" refers to, and I would like to see an example of incorrect sorting behaviour that I can test using the Unicode Collation Algorithm (UCA).

If you are going to make a proposal for a new character you will need to give specific examples of incorrect behaviour, and explain why this incorrect behaviour cannot be remedied by tweaking Unicode properties or the UCA.  On the Unicode internal (Unicore) mailing list Asmus Freytag suggested that the word break property of NNBSP could be changed so that by default there would be no word break when the character before and after it belonged to the same category (e.g. both letters, as is the case for Mongolian).  Making this change should solve the word boundary issue, as early as Unicode 9.0 next June if someone makes a proposal to the UTC soon, but encoding a new character will take at least two years, possibly much longer if there is opposition from ISO national bodies.

It may take a while before Word catches up with changes to the word break property, but it would take even longer for Word to support a new character.  In my opinion, the main advantage of property change over encoding a new character is that the property change will fix existing Mongolian text, whereas the new character will have no effect on existing Mongolian text, and users will still complain that word selection etc. does not work for pre-new-character Mongolian text (and users will not even start to use the new character until it is not displayed as an empty box on their system, and it produces the expected shaping behaviour, which will probably be several years after the several years to get it encoded).

A further problem with encoding a new character is that when it is eventually supported by fonts and rendering systems, Mongolian text with NNBSP and Mongolian text with the new character will look the same to end users, with the result that users will start to complain that internet searches and local find/replace functions do not work correctly for Mongolian because searching for a Mongolian word with the new character will not match the same word with NNBSP and vice versa.  And this problem will never go away, because no-one is going to magically change existing Mongolian data, and input methods and users will continue to use NNBSP in place of the new character for years to come -- why not? they both look the same and produce the same visual result.

All in all, I firmly believe that encoding a new character will create more and worse problems than it solves.

Andrew
Received on Thursday, 30 July 2015 00:09:29 UTC