- From: Felix Sasaki <fsasaki@w3.org>
- Date: Tue, 16 Aug 2005 17:32:06 +0900
- To: "www-international@w3.org" <www-international@w3.org>, public-iri@w3.org
- Message-ID: <op.svldzsmix1753t@ibm-60d333fc0ec>
Hi, This came up at the unicode list. I'm not sure if is an issue for IRI / the future development of the IRI spec, but I'm bringing it to the attention of www-international and public-iri. Any comments? -- Felix ------- Forwarded message ------- From: "Eric Muller" <emuller@adobe.com> To: Cc: "UnicoRe Mailing List" <unicore@unicode.org> Subject: Re: ZWNJ and ZWJ being ignorable Date: Sun, 14 Aug 2005 10:14:28 +0900 Roozbeh Pournader wrote: > TUS 4.0, Page 391, paragraph 4, mentions: > > ZERO WIDTH NON-JOINER and ZERO WIDTH JOINER are format control > characters. Like other such characters, they should be ignored > by processes that analyze text content. For example, a > spelling-checker or find/replace operation should filter them > out. [...] > > Fact: > > 1. That is *incorrect* for *every* language I know that is written in > the Arabic script and uses ZWNJ or ZWJ. In all of these languages, > ommitting a ZWNJ or ZWJ, or misplacing them, is often a spelling error > > > The issue of ignoring the joiners has resurfaced recently in IDN and IRIs. Martin Hosken added about Burmese: > It certainly causes sorting differences in the dreaded Burmese. 10xx 1039 > 200C 101B sorts with the syllable break before 101B, while 10xx 1039 101B > sorts as a single syllable with any break occuring before 10xx. > > Furthermore, Burmese also has: Lincoln: <101c, 1004, virama, zwnj, 1000, 1014, virama, zwnj>, where the segment <1004, 1039, zwnj> renders as a visible virama above the representative glyph for 1004 bimbo: <1018, 1004, virama, zwnj, 1018, 102d, 102f>, idem versus Bengali: <1018, 1004, virama, 1002, 102c, 101c, 102e> where the segment <1004, virama> renders as the kinzi (epsilon like) and is placed above the rendering of the segment <1002>. Tuesday: <1021, 1004, virama, 1002, 102c>, idem In all these examples, the segment <1004, virama> or <1004, virama, zwnj> is used to write the same sound "in". Okell ("Burmese, an introduction to the script", p78): When you are taking dictation and come across a word with the rhyme [sound "in"], you don't know - unless you already have learned the spelling of the word - whether it should be written the full [visible virama over 1004] or the reduced [kinzi]. and clearly indicates that it would considered a spelling error to use one form for the other. There is no mention of two words which differ only by [visible virama over 1004] vs. [kinzi], i.e. where the use or non-use of zwnj is contrastive. Still, registering a domain name with "lincoln" it in, and having it show up with kinzi (because the zwnj is ignored) is problematic. Eric.
Attachments
- text/html attachment: attachment275.htm
Received on Tuesday, 16 August 2005 08:32:22 UTC