Re: NNBSP Impact from Richard Wordingham on 2015-07-29 (public-i18n-mongolian@w3.org from July to September 2015)

From: Richard Wordingham <richard.wordingham@ntlworld.com>
Date: Wed, 29 Jul 2015 22:32:10 +0100
To: public-i18n-mongolian@w3.org, unicore@unicode.org
Message-ID: <20150729223210.6fc3971e@JRWUBU2>

(I've copied this to the UniCore list in case the discussion moves
from there to the general Unicode list rather than to the
public-i18n-mongolian list.)

Badral S. wrote on Wed, 29 Jul 2015 at 20:50:10 +0900

> I do not know france.
> When france word and mongolian word connected with NNBSP, the NNBSP 
> belong to which one ?  This case exists in Mongolian document like
> mongolian people studing france language. (asume the france languauge
> need NNBSP)

The present word-break property value of NNBSP is "Other".  With this
property, there is a word break on either side of it, so there would
be three items:

1) The French word.
2) The NNBSP - not a word.
3) The Mongolian word.

French can use NNBSP to provide extra spacing between a following word
and punctuation, such as a full stop (.) or a comma (,).  I do not
believe it uses it to separate words.  If French used U+2009 THIN
SPACE instead, there could be a new line break before the punctuation,
which would be wrong for French.  Therefore the French, or rather,
those of them who care about such small details, have apparently been
using NNBSP.

Now, if the word-break property of NNBSP were given the value it should
have been given in the first place, "MidLetter", we will see the
following word breaking patterns:

For Mongolian word, NNBSP, Mongolian suffix:

1 word, =  Word + NNBSP + suffix.

For French word, NNBSP, Mongolian suffix:

1 word, = French word + NNBSP + Mongolain suffix.

For French word, NNBSP, comma, there will be three items:

1) The French word.
2) The NNBSP - not a word.
3) Comma - not a word.

I believe these are the desired outcomes.  None of the Unicode *rules*
have to change; all that has to change is one of the the data files.
(A list in UAX#29 would also be changed for consistency.)  For programs
that use ICU for word-breaking, the change would occur when they update
to the version of ICU released after the change to the Unicode Character
Database (UCD). As to what happens for other programs, that is
unpredictable.  They would change after the UCD changes, but there can
be a long delay.  At least Windows 10 users will not have to upgrade to
another version of Windows.

Richard.

Received on Wednesday, 29 July 2015 22:31:03 UTC