Re: NNBSP Impact from Badral S. on 2015-07-22 (public-i18n-mongolian@w3.org from July to September 2015)

From: Badral S. <badral@bolorsoft.com>
Date: Wed, 22 Jul 2015 03:04:05 +0200
To: public-i18n-mongolian@w3.org
Message-ID: <55AEEC05.30901@bolorsoft.com>
Hi,
First of all, if we introduce new character, we need to name "Mongolian 
Suffix Joiner (with space)" not separator. This problem is not only for 
font developers or unicode or rendering engine vendors. The name of 
separator or space confuse developers too much and increase the risk of 
handling as word boundary. If someone do pre- or post- processing, who 
develop a editor or browser then would be corrupted mongolian text.
As my experience and response from unicode we couldn't use NNBSP as 
Suffix Joiner because it is already classified into word boundary class. 
I didn't understand why is it currently "XX" as Andrew wrote.
As Andrew's suggestion, ZWNJ or ZWJ are definitive not an option to use 
in place of NNBSP. It is simply not imaginable.
In my opinion, if 180F is not classified into any class, then it could 
be easily handled in OTF. But I don't know how hard to sell new 
character into UTC or Uniscribe.

Badral

On 19.07.2015 01:59, jrmt@almas.co.jp wrote:
> Hi Martin,
>
> Thank you very much for your detailed advice.
> Yes. We should find out more clear reasons for ZWNJ not working for Mongolian Suffix Separator,
> As same with NNBSP as well.
> Sorry about my simply send out mail because of I am travelling outside office.
> My colleague Siqin Bilige have already send out some of the paper to explain why this two character not work for Mongolian Suffix Separator.
>
> For the ZWNJ,
> 1. ZWNJ is been using in Mongolian for displaying one word with their isolate form without space and using for join two word for name to handle it as one word.
> 2. Mongolian Suffix Separator need one space width character but it is not a space actually, but ZWNJ is zero width itself.
> 3. ZWNJ will not change the display form of following character. But Mongolian Suffix Separator have behavior of changing the display form of following Mongolian character.
> 4. We definitely require another one character that is different with ZWNJ to handle Mongolian Suffix Separator, because of one character can be not used for two different behavior in rendering logic.
>
> For the NNBSP, Greg already list out several reason for need another Character U+180F for Mongolian Suffix Separator.
> Her we are getting a lot of trouble with NNBSP used as Mongolian Suffix Separator. For Example, In addition to Greg's list
> 1. When we copy the Mongolian text from PDF document, the NNBSP is lost and it is a big problem for Mongolian.
> 2. When we using full text search engine, most of the engine handles the NNBSP as a word separator,
>     but actually we will need the suffix together with the previous word.
> 3. We getting trouble on word-count statistics wrong number as we expected in OpenOffice.
> 4. We are getting a lot kind of trouble in Java development for processing Mongolian Text.
> 5. We are getting wrong display of Mongolian Suffix using NNBSP in Translation memory system SDL Trados Studio.
>
> Please do not the ZWNJ or NNBSP for Mongolian Suffix Separator. We really need a new one character for this.
>
> In the other mail from Andrew West,
>> Personally I am not in favour of replacing NNBSP with a new character at this late stage in the game,
>> and I think that it will be a very hard sell to the UTC,
>> who I suspect will be very concerned about destabilizing existing Mongolian data if a new character is introduced.
>> As the issues raised in this discussion about NNBSP do not involve shaping at the rendering level,
>> but are problems related to correctly determining word boundaries by software that processes Mongolian data,
>> in my opinion the best solution would be to modify the word break property of NNBSP (which is currently "XX").
> The NNBSP for Mongolian Suffix Problem is not a just word-wrap problem actually as listed above.
> Secondly, if we are concerned about the existing Mongolian Data, it is not so difficult to solve this problem.
> Actually, there are not so big volume data exists in this encoding currently, even Unicode Mongolian part had been included in Unicode 3.0 in 2000.
> As my point of view, introducing one new character will be more efficient to stabilize the Mongolian in the short period.
>
> Thanks and Best Regards,
>
> Jirimutu
> ==========================================================
> Almas Inc.
> 101-0021 601 Nitto-Bldg, 6-15-11, Soto-Kanda, Chiyoda-ku, Tokyo
> E-Mail: jrmt@almas.co.jp   Mobile : 090-6174-6115
> Phone : 03-5688-2081,   Fax : 03-5688-2082
> http://www.almas.co.jp/   http://www.compiere-japan.com/
> ==========================================================
>
>
> -----Original Message-----
> From: Martin J. Dürst [mailto:duerst@it.aoyama.ac.jp]
> Sent: Friday, July 17, 2015 6:08 PM
> To: jrmt@almas.co.jp; 'Greg Eck'; public-i18n-mongolian@w3.org
> Subject: Re: NNBSP Impact
>
> Hello Jirimutu,
>
> On 2015/07/17 16:05, jrmt@almas.co.jp wrote:
>> Hi Greg and Martin,
>>
>>
>>
>> I would like to remind that the ZWNJ and ZWJ had been used in Mongolian.
>>
>> ZWJ – used for separately display Mongolian Init, Medi, and Fina form.
>>
>> ZWNJ – used for display one word with their isolate form without space.
> As the text in the Unicode standard explains, ZWNJ and ZWJ are used for this also in Arabic/Persian. They are also used in Persian for separating suffixes (Arabic doesn't have separable suffixes).
>
>
>> For this reason, we should not use ZWNJ for Mongolian Suffix Separator.
> There may be good reasons for not using ZWNJ as a Mongolian Suffix Separator, but the above alone are not convincing (not to me, and most probably also not to the Unicode Technical Committee).
>
> So it would be good to find more clearcut reasons for why ZWNJ doesn't work (or cannot be made to work) as a Mongolian Suffix Separator, or to verify more thoroughly that there are no such reasons.
>
> Regards,  Martin.
>
>
>> Thanks and Best Regards,
>>
>>
>>
>> Jirimutu
>>
>> ==========================================================
>>
>> Almas Inc.
>>
>> 101-0021 601 Nitto-Bldg, 6-15-11, Soto-Kanda, Chiyoda-ku, Tokyo
>>
>> E-Mail:  <mailto:jrmt@almas.co.jp> jrmt@almas.co.jp   Mobile : 090-6174-6115
>>
>> Phone : 03-5688-2081,   Fax : 03-5688-2082
>>
>>    <http://www.almas.co.jp/> http://www.almas.co.jp/    <http://www.compiere-japan.com/> http://www.compiere-japan.com/
>>
>> ==========================================================
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> From: Greg Eck [mailto:greck@postone.net]
>> Sent: Thursday, July 16, 2015 1:55 AM
>> To: Martin J. Dürst; public-i18n-mongolian@w3.org
>> Subject: RE: NNBSP Impact
>>
>>
>>
>> Hi Martin,
>>
>> Thank you for your good comments. I have taken some time to review
>> Chapter 23 of the Unicode Standard 7.0 as referenced below. I can see
>> your point somewhat in the possibility of the ZWNJ taking the place of
>> the NNBSP - even though it is a bit non-intuitive. I guess I am
>> against the idea for two reasons. The first is that as the name
>> implies, there is actually to be no space emitted by the rendering
>> system - it is designed to have zero width. However the
>> NNBSP_replacement needs to have space (while at the same time not
>> being space). I say this recognizing the statement that some fonts
>> render the ZWNJ with space. The second reason that I would not go for
>> the idea is that time will probably tell us that we need a character
>> specific to the Mongolian block that we can specifically taylor to the
>> needs of this separation between a STEM+Suffix OR a Suffix+Suffix. If
>> we go for another character that is multi-functional as the ZWNJ is
>> and it fails to serve this new function as a replacemen
> t for the NNBSP, then we are in trouble again as we are now. I think we should still call for a completely new character that we can count on for time to come. The MVS was originally created for the sole purpose of separating the stem from the special final A/E. Let's create another sole-purpose character that will do the job specifically of separating the STEM/Suffix and the Suffix/Suffix.
>> Greg
>>
>>
>>
>> I have created a spreadsheet as attached showing the features of the MVS as compared to the NNBSP. The differences between the two characters are highlighted in yellow. As the MVS appears to be doing pretty good in the areas where the NNBSP is deficient, I suggest that we study through the MVS features and use the MVS features to model the new NNBSP_replacement character. I do not understand all of the features attached to the MVS as listed. Do we have someone who could analyze the differences and start a features list for the new NNBSP_replacement character?
>>
>>
>>
>> Thanks,
>> Greg
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Martin J. Dürst [mailto:duerst@it.aoyama.ac.jp]
>> Sent: Wednesday, July 15, 2015 7:15 PM
>> To: Greg Eck <greck@postone.net>; public-i18n-mongolian@w3.org
>> Subject: Re: NNBSP Impact
>>
>>
>>
>> Hello Greg,
>>
>>
>>
>> On 2015/07/15 11:08, Greg Eck wrote:
>>
>>> Hi Martin,
>>> Thanks for the comment. No one has mentioned the ZWNJ yet. I have found that the ZWNJ is helpful in simulating context in Mongolian examples.
>>
>>
>> Yes, that's one of its two main usages. The other is for suffixes.
>>
>>
>>
>>
>>
>>> But probably not what we need here in the case of glue-ing the suffixes together.
>>
>>
>> I suggest you look at Chapter 9 and Chapter 23.2 of the Unicode Standard.
>>
>>
>>
>> In particular, I found the following text on page 800 of
>>
>>    <http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf> http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf:
>>
>>
>>
>> Zero-Width Spaces and Joiner Characters. The zero-width spaces are not to be confused with the zero-width joiner characters. U+200C zero width non-joiner and U+200D zero width joiner have no effect on word or line break boundaries, and zero width nobreak space and zero width space have no effect on joining or linking behavior. The zero-width joiner characters should be ignored when determining word or line break boundaries. See “Cursive Connection” later in this section.
>>
>>
>>
>> The "ignore word break" is exactly what you are looking for, as far as I understand. As for line breaks, I have no idea how the work in Mongolian, but if there is something like intra-word linebreaks (with hyphenation or similar or without), then that will be handled by the language-dependent line breaking logic even if the zero-width non-joiner doesn't by default provide a line-break opportunity.
>>
>>
>>
>> I'm not at all an expert for Mongolian, and so I may be missing something. But I think there is a high chance that you will be asked similar questions if you send a formal proposal to the UTC, and so it may be worth a more careful check.
>>
>>
>>
>> One thing I was concerned about in my previous mail is that a "zero width" non-breaking space would not be wide enough (because at least the name suggests that it's smaller than a "narrow" space). However, looking at the examples at the SampleOfDagDeg.pdf document, the 'spaces' between the stem and the suffix seem to be about the same as the 'spaces' where the letters cannot be connected, and would be a font matter anyway, so there shouldn't be any serious problems there.
>>
>>
>>
>> Regards,   Martin.
>>
>>
>>
>>> Greg
>>> -----Original Message-----
>>> From: Martin J. Dürst [ <mailto:duerst@it.aoyama.ac.jp>
>>> mailto:duerst@it.aoyama.ac.jp]
>>> Sent: Wednesday, July 15, 2015 9:38 AM
>>> To: Greg Eck;  <mailto:public-i18n-mongolian@w3.org>
>>> public-i18n-mongolian@w3.org
>>> Subject: Re: NNBSP Impact
>>> Hello Greg, others,
>>> To me it looks like the situation for Mongolian suffixes is vaguely familiar to the situation with Persian suffixes that are written with a slight separation. What is used in Persian is the ZERO WIDTH NON-JOINER (ZWNJ). Although it's name includes "zero width", in all the example I have seen there is actually some white space between the characters, i.e. they are not glued together.
>>> I'm sorry if this has already been considered.
>>> Regards,   Martin.
>>> On 2015/07/15 10:15, Greg Eck wrote:
>>>> I am calling for an a new control character to replace the NNBSP (U+202F) for usage specifically in the Mongolian block (U+1800-18AF).
>>>> Given our discussion over the past few weeks, it appears that the NNBSP is too generic to handle the specific needs of the Mongolian script in at least the following areas:
>>>> -          NNBSP (“Narrow Non-Breaking SPace” actually is a space
>>>> -          The control character needed in the Mongolian Script needs to be a non-space
>>>> -          Word-count utility breaks as a result of the NNBSP presence
>>>> -          Spell-checkers have difficulty parsing as the word breaks upon encountering the NNBSP
>>>> -          Sort routines have the same difficulty
>>>> -          Word-jumping (as with MS Word CTL-RIGHT/LEFT) breaks due to the space feature inherent to the NNBSP
>>>> -          Cannot redefine the NNBSP as it is used as a bona fide space in other languages
>>>> -          Future utilities as yet undefined
>>>> -          Others?
>>>> Means of implementation would be specific to the individual font developers.
>>>> The features of the new character would be very similar to the MVS (U+180E).
>>>> Suggested code-point: U+180F
>>>> Suggested name: Mongolian Suffix Separator (to match the similar
>>>> name
>>>> Mongolian Vowel Separator) Can I call for individuals to speak up on backing the notion and also for individuals who might not agree with the notion?
>>>> There is a UTC meeting the end of July – if there is consensus, maybe we could get it on the docket?
>>>> Greg
>>
>
>


-- 
Badral Sanlig, Software architect
www.bolorsoft.com | www.badral.net
Bolorsoft LLC, Selbe Khotkhon 40/4 D2, District 11, Ulaanbaatar
Received on Wednesday, 22 July 2015 01:04:41 UTC