- From: Felix Sasaki <fsasaki@w3.org>
- Date: Tue, 16 Aug 2005 17:32:06 +0900
- To: "www-international@w3.org" <www-international@w3.org>, public-iri@w3.org
- Message-ID: <op.svldzsmix1753t@ibm-60d333fc0ec>
Hi,
This came up at the unicode list. I'm not sure if is an issue for IRI /
the future development of the IRI spec, but I'm bringing it to the
attention of www-international and public-iri. Any comments?
-- Felix
------- Forwarded message -------
From: "Eric Muller" <emuller@adobe.com>
To:
Cc: "UnicoRe Mailing List" <unicore@unicode.org>
Subject: Re: ZWNJ and ZWJ being ignorable
Date: Sun, 14 Aug 2005 10:14:28 +0900
Roozbeh Pournader wrote:
> TUS 4.0, Page 391, paragraph 4, mentions:
>
> ZERO WIDTH NON-JOINER and ZERO WIDTH JOINER are format control
> characters. Like other such characters, they should be ignored
> by processes that analyze text content. For example, a
> spelling-checker or find/replace operation should filter them
> out. [...]
>
> Fact:
>
> 1. That is *incorrect* for *every* language I know that is written in
> the Arabic script and uses ZWNJ or ZWJ. In all of these languages,
> ommitting a ZWNJ or ZWJ, or misplacing them, is often a spelling error
>
>
>
The issue of ignoring the joiners has resurfaced recently in IDN and IRIs.
Martin Hosken added about Burmese:
> It certainly causes sorting differences in the dreaded Burmese. 10xx 1039
> 200C 101B sorts with the syllable break before 101B, while 10xx 1039 101B
> sorts as a single syllable with any break occuring before 10xx.
>
>
Furthermore, Burmese also has:
Lincoln: <101c, 1004, virama, zwnj, 1000, 1014, virama, zwnj>, where
the segment <1004, 1039, zwnj> renders as a visible virama above the
representative glyph for 1004
bimbo: <1018, 1004, virama, zwnj, 1018, 102d, 102f>, idem
versus
Bengali: <1018, 1004, virama, 1002, 102c, 101c, 102e> where the
segment <1004, virama> renders as the kinzi (epsilon like) and is
placed above the rendering of the segment <1002>.
Tuesday: <1021, 1004, virama, 1002, 102c>, idem
In all these examples, the segment <1004, virama> or <1004, virama,
zwnj> is used to write the same sound "in". Okell ("Burmese, an
introduction to the script", p78):
When you are taking dictation and come across a word with the rhyme
[sound "in"], you don't know - unless you already have learned the
spelling of the word - whether it should be written the full
[visible virama over 1004] or the reduced [kinzi].
and clearly indicates that it would considered a spelling error to use
one form for the other. There is no mention of two words which differ
only by [visible virama over 1004] vs. [kinzi], i.e. where the use or
non-use of zwnj is contrastive. Still, registering a domain name with
"lincoln" it in, and having it show up with kinzi (because the zwnj is
ignored) is problematic.
Eric.
Attachments
- text/html attachment: attachment275.htm
Received on Tuesday, 16 August 2005 08:32:22 UTC