Re: comments on Character Model for the World Wide Web: String Matching and Searching from Najib Tounsi on 2014-06-20 (www-international@w3.org from April to June 2014)

From: Najib Tounsi <ntounsi@gmail.com>
Date: Fri, 20 Jun 2014 18:34:56 +0100
To: "Phillips, Addison" <addison@lab126.com>, Asmus Freytag <asmusf@ix.netcom.com>, Matitiahu Allouche <matitiahu.allouche@gmail.com>, "www-international@w3.org" <www-international@w3.org>
Message-ID: <53A470C0.6040702@emi.ac.ma>

On 6/19/14 8:44 PM, Phillips, Addison wrote:
>> On 6/19/2014 11:27 AM, Najib Tounsi wrote:
>>> On 6/19/14 2:51 PM, Matitiahu Allouche wrote:
>>>> 11) In 2.2 table of Compatibility Equivalence, the third example is
>>>> labelled "Cursive forms". I think that this would be better labelled
>>>> "character shapes". Rationale: the example shows various shapes of an
>>>> Arabic letter. But similar examples could be taken from final versus
>>>> non-final shapes of some Hebrew letters, or from the final versus
>>>> non-final shapes of the Greek sigma letter. Hebrew and Greek are not
>>>> cursive scripts, so the issue here is having position-dependent
>>>> shapes, not cursiveness.
>> The Greek final sigma uses a different character code which is not a
>> compatibility equivalent.
>>
>> The reason is that, unlike Arabic positional shaping, the selection of the final
>> form cannot be determined algorithmically at rendering time and would
>> otherwise introduce the need to use ZWNJ with Greek; not a good tradeoff.
>>
>> Whatever example is used needs to be limited to cases of automatic shape
>> selection at rendering.
>>
> Context matters here. The table is not merely one containing characters that use contextual shaping. These are *specifically* characters with compatibility decompositions in Unicode and the table is illustrating the various kinds of compatibility decomposition. I tend to agree with Mati's comment that "cursive forms" is not that accurate a label. In practice only Arabic uses <initial>, <medial>, <final>, and <isolated> decompositions, though, so the other offered examples are not what the table is meant to illustrate. The items in the table are the four compatibility variations of ARABIC LETTER NOON (U+0646).

Actually, you use codepoints in the zone "Arabic presentation Form B" to 
have the desired shape (isolated, final, initial and medial) for the 
Arabic characters (e.g. U+FEE5, U+FEE6, U+FEE7 and U+FEE8 for the ARABIC 
LETTER NOON).

But for some characters  (Beh U+0628, teh U+062A, noon U+0646,…) some 
fonts in some browsers don't render the difference between isolated and 
final forms and between initial and medial forms.

I suggest to use the ARABIC LETTER HEH (U+FEE9, U+FEEA, U+FEEB and 
U+FEEC) for example. Or ARBIC LETTER AIN (U+FEC9, U+FECA, U+FECB and 
U+FECC) in stead of NOON.

> Note that this table is identical to Figure 2 in UAX#15.

Seems they use the Tahoma font. It renders correctly the shapes of the 
ARABIC LETTER NOON

Regards, Najib

>
> Addison

Received on Friday, 20 June 2014 17:27:36 UTC