W3C home > Mailing lists > Public > www-international@w3.org > July to September 2005

Fwd: Re: ZWNJ and ZWJ being ignorable

From: Felix Sasaki <fsasaki@w3.org>
Date: Tue, 16 Aug 2005 17:32:06 +0900
To: "www-international@w3.org" <www-international@w3.org>, public-iri@w3.org
Message-ID: <op.svldzsmix1753t@ibm-60d333fc0ec>
Hi,

This came up at the unicode list. I'm not sure if is an issue for IRI /  
the future development of the IRI spec, but I'm bringing it to the  
attention of www-international and public-iri. Any comments?

-- Felix

------- Forwarded message -------
From: "Eric Muller" <emuller@adobe.com>
To:
Cc: "UnicoRe Mailing List" <unicore@unicode.org>
Subject: Re: ZWNJ and ZWJ being ignorable
Date: Sun, 14 Aug 2005 10:14:28 +0900

Roozbeh Pournader wrote:

> TUS 4.0, Page 391, paragraph 4, mentions:
>
>        ZERO WIDTH NON-JOINER and ZERO WIDTH JOINER are format control
>        characters. Like other such characters, they should be ignored
>        by processes that analyze text content. For example, a
>        spelling-checker or find/replace operation should filter them
>        out. [...]
>
> Fact:
>
> 1. That is *incorrect* for *every* language I know that is written in
> the Arabic script and uses ZWNJ or ZWJ. In all of these languages,
> ommitting a ZWNJ or ZWJ, or misplacing them, is often a spelling error
>
>
>
The issue of ignoring the joiners has resurfaced recently in IDN and IRIs.

Martin Hosken added about Burmese:

> It certainly causes sorting differences in the dreaded Burmese. 10xx 1039
> 200C 101B sorts with the syllable break before 101B, while 10xx 1039 101B
> sorts as a single syllable with any break occuring before 10xx.
>
>
Furthermore, Burmese also has:

     Lincoln: <101c, 1004, virama, zwnj, 1000, 1014, virama, zwnj>, where
     the segment <1004, 1039, zwnj> renders as a visible virama above the
     representative glyph for 1004

     bimbo: <1018, 1004, virama, zwnj, 1018, 102d, 102f>, idem

versus

     Bengali: <1018, 1004, virama, 1002, 102c, 101c, 102e> where the
     segment <1004, virama> renders as the kinzi (epsilon like) and is
     placed above the rendering of the segment <1002>.

     Tuesday: <1021, 1004, virama, 1002, 102c>, idem

In all these examples, the segment <1004, virama> or <1004, virama,
zwnj> is used to write the same sound "in". Okell ("Burmese, an
introduction to the script", p78):

     When you are taking dictation and come across a word with the rhyme
     [sound "in"], you don't know - unless you already have learned the
     spelling of the word - whether it should be written the full
     [visible virama over 1004] or the reduced [kinzi].

and clearly indicates that it would considered a spelling error to use
one form for the other. There is no mention of two words which differ
only by [visible virama over 1004] vs. [kinzi], i.e. where the use or
non-use of zwnj is contrastive. Still, registering a domain name with
"lincoln" it in, and having it show up with kinzi (because the zwnj is
ignored) is problematic.

Eric.


Received on Tuesday, 16 August 2005 08:32:20 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:05 GMT