RE: comments on Character Model for the World Wide Web: String Matching and Searching

> Actually, you use codepoints in the zone "Arabic presentation Form B" to have
> the desired shape (isolated, final, initial and medial) for the Arabic characters
> (e.g. U+FEE5, U+FEE6, U+FEE7 and U+FEE8 for the ARABIC LETTER NOON).

The code points in the table were selected to be U+FEE5, etc., not to generate specific shapes, but because those code points are compatibility equivalents to ARABIC LETTER NOON. If you wrote a sequence U+FEE7.FEE8.FEE6 (initial, medial, and final noon) and next to it had the sequence U+0646.0646.0646 (plain old noon), the two sequences should look identical and, when normalized to NFKC or NFKD, they become identical. The appearance of the two sequences is not the issue.

What Charmod is dealing with is that these elements have different class names, even if you can't see it when viewing as text (pretend I used the code points rather than HTML character escapes in the class names rather than after):

<p class=&#xFEE7;&#xFEE8;&#xFEE6;> ﻧﻨﻦ</p>
<p class=&#x646; &#x646; &#x646;> ننن</p>

> > Note that this table is identical to Figure 2 in UAX#15.
> 
> Seems they use the Tahoma font. It renders correctly the shapes of the ARABIC
> LETTER NOON
> 
It doesn't actually matter what they look like. It might even help the argument in the surrounding text if some of the code points look identical ;-).

Addison

Received on Friday, 20 June 2014 18:21:08 UTC