Re: [charmod-norm] Arabic & Hebrew unique issues associated with sections 2.4 and 2.5

Code page is not a problem. The problem is in different bidi layouts 
in which data can be present. Different bidi layouts are realized in 
different position of characters / text segments. Thus comparing of 
the same data in different bidi layouts will most definitely produce 
incorrect results. Since for historic reasons different bidi layouts 
are associated with EBCDIC code page and legacy systems I mentioned 
both. However, in general case, even data stored in Unicode can also 
be present in different Bidi layouts. Bidi layouts and code pages are 
completely orthogonal concepts (from functional / technical 
perspectives). 

The suggested textual amendments is as follows:
    assure that text being sorted / searched is present in the same 
bidi layout.

Normalization to same bidi layout is conceptually similar to code page
 conversion. Before you can compare two pieces of text you must assure
 they are encoded with the same code page (i.e. Unicode). Very 
similarly, if you wish to compare two pieces of Bidi text, you must 
assure they are transformed to common bidi layout. For more 
information on bidi layouts please see: 
http://www.ibm.com/developerworks/websphere/library/techarticles/bidi/bidigen.html

PS. The encodings you mentioned are relevant for display (browser 
interpret data encoded in such encodings differently). When we are 
talking about search  / sort, we refer to text in the storage.


-- 
GitHub Notification of comment by tomerm
Please view or discuss this issue at 
https://github.com/w3c/charmod-norm/issues/80#issuecomment-207942376 
using your GitHub account

Received on Sunday, 10 April 2016 08:19:32 UTC