W3C home > Mailing lists > Public > www-international@w3.org > July to September 2003

RE: The fate of Hebrew texts with Hyphen-Minus instead of Maqaf

From: Addison Phillips [wM] <aphillips@webmethods.com>
Date: Wed, 17 Sep 2003 23:26:02 -0700
To: <bidi@prognathous.mail-central.com>, "Jony Rosenne" <rosennej@qsm.co.il>, "'Mark Davis'" <mark.davis@jtcsv.com>, <www-international@w3.org>
Cc: "Shmuel Yair" <yshmuel@microsoft.com>
Message-ID: <PNEHIBAMBMLHDMJDDFLHAEDNHAAA.aphillips@webmethods.com>

I think you are confusing two different things, input and output, with being
a single algorithm.

Unicode Bidi is concerned only with output--how to correctly display any
given string of Unicode characters. It says absolutely nothing about how
input is to be handled and has nothing whatever to do with keyboard mappings
or input handling. What it does give us is a common, interchangeable basis
on which to interpret any given character sequence for display
directionality. As such, it does imply what the contents of a given string
could or should be in order to achieve a given display effect. There are
various characters such the various bidi controls that can be used to
achieve these results. The fact that there exists a sequence of characters
that, when using the Unicode Bidi algorithm, is both logical and displays
correctly implies that the Unicode bidi algorithm is not what needs
changing!

It may be the case that the character sequence isn't optimal, but I think
that is a quibble at best. Users generally do not care what the character
sequence in memory is, only with the results on the display (the graphemes).

On the input side, I infer that you expect a one-to-one key-to-character
mapping. This isn't an accurate model, even for English keyboards (consider
typesetters quotes, alt sequences, and so forth), nor for most Western
European keyboards, let alone for Hebrew. For example, I commonly switch to
the French keyboard to type common Western European diacriticals.

For example, on that keyboard, Shift+{ (on my US QWERTY keypad) produces the
"dead key" for umlaut (dieresis), which I then follow with the modified
letter (let's say 'u' for now). This doesn't produce a String containing
U+0308 U+0075 (the key sequence, which would, of course, be very wrong).
Neither does it produce U+0075 U+0308 (which would be correct Unicode). It
produces U+00FC.

In other words: Microsoft could produce a Unicode string containing a
(Unicode Bidi) correct sequence, given that they perform this interpretation
internally. I think that was Jony's point. It isn't good that users must
"work around" existing implementations. But the people to complain to are
those that produce the implementations, in my opinion. What good are
standards if people ignore them?

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility

432 Lakeside Drive, Sunnyvale, CA, USA
+1 408.962.5487 (office)  +1 408.210.3569 (mobile)
mailto:aphillips@webmethods.com

Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International/ws

Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: www-international-request@w3.org
> [mailto:www-international-request@w3.org]On Behalf Of
> bidi@prognathous.mail-central.com
> Sent: Wednesday, September 17, 2003 4:02 PM
> To: Jony Rosenne; 'Mark Davis'; www-international@w3.org
> Cc: Shmuel Yair
> Subject: RE: The fate of Hebrew texts with Hyphen-Minus instead of Maqaf
>
>
>
> On Wed, 17 Sep 2003 20:44:21 +0200, "Jony Rosenne"
> <rosennej@qsm.co.il> said:
> > These existing texts are the result of a bug in Microsoft software.
> > Microsoft had asked the UTC to change the classification of
> > Hyphen-Minus according to their implementation, and the request was not
> > accepted.
>
> Microsoft's implementation is preferred over the Unicode algorithm for
> the following reasons:
> 1. It keeps the sequence as a whole. "20-", rather than "20 -". The
>    latter form is used by some users to circumvent the UBA mishandling of
>    such sequences. The extra space is against the rules of the Hebrew
>    language. These sequences are supposed to include a Maqaf, not a dash.
> 2. It can be used satisfactorily with all Hebrew keyboard layouts, even
>    ones that don't map the Maqaf, e.g. both the Israeli standard keyboard
>    layout (SI-1452) and the one used in Windows.
> 3. It works with the same logic that people use when writing, i.e. Hebrew
>    letter first, then Minus-Hyphen/Maqaf, and finally the number. Note
>    that with applications that implement the UBA, some people incorrectly
>    type the Minus-Hyphen after the number to keep the correct order.
>    No wonder they all consider *this* behavior to be a bug.
> 4. It can be used with character sets that do not include the Maqaf (such
>    as ISO-8859-8).
> 5. It is easy to use and straightforward. No need to type arcane and
>    hidden control characters.
>
> Bottom line: The way Microsoft handles these sequences is not a bug, it's
> a feature. It works extremely well, and has no drawbacks.
>
> > > To the best of my knowledge, there are no cases in the Hebrew
> > > language where a negative number is preceded by Hebrew letter without
> > > another HyphenMinus/Maqaf in between ("-20"). Since there's no
> > > ambiguity here, it should be very much possible to revise the
> > > algorithm so that it deals with such sequences.
> >
> > So there should be no problem for a text processor to get it right and
> > produce the correct Unicode data stream.
>
> Are you suggesting that the rendering application will do some
> pre-processing and insert control characters? or perhaps that it will
> replace Hyphen-Minus with Maqaf marks? if so, wouldn't you consider such
> pre-processing as part of actual BiDi algorithm?
>
> Moreover, I'm not sure that modifying the original texts without the
> authors' consent is an acceptable solution.
>
> > > > or there are inconsistencies between different usage patterns,
> > >
> > > Which usage patterns exactly? I can't think of one that this revision
> > > will break.
> >
> > I did not see any proposed revision, I only saw a description of the
> > problem.
>
> You mentioned the proposed revision in the first paragraph.
>
> Prog.
Received on Thursday, 18 September 2003 02:27:31 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:00 GMT