Re: Unicode Normalization

Hi Benjamin,

On Feb 4, 2009, at 9:17 PM, Benjamin wrote:

> On Wed, Feb 4, 2009 at 3:07 PM, Robert J Burns <rob@robburns.com>  
> wrote:
>
> 〈this string〉
> 〈this string〉
> 〈this string〉
>
>
> When I copied the three strings to a text editor it did not change  
> the characters; they remained the same.

Same for me, These byte-wise distinctions have been preserved round- 
trip from sending it to the list and getting back in replies. But it  
wouldn't surprise me that John Kew is finding different results (since  
normalizing those characters is permitted by Unicode)

> Also, I can see a difference between the characters; The two  
> brackets at the top and the one on the bottom left are duller, while  
> the other three are sharper. This difference is apparent in both the  
> browser and the text editor(Not sure if it matters, though).

I would say that is a bug in your font. Fonts, by using separate  
glyphs for canonically equivalent characters, contribute to the  
confusion authors face when creating content. The glyph distinctions  
lead authors to treat the characters semantically distinct (which  
shouldn't happen). Fonts play an important role in this (on par with  
input systems) since the fonts control the glyphs used. For example if  
a font uses the same glyphs for "½" as the font maker uses for the  
compatibility equivalent sequence "1⁄2", this helps with Unicode  
authoring. It is remarkable how few font makers take minimal amount of  
time necessary to do this. This is a similar problem to font/glyph  
issues outlined earlier by Andrew Cunningham with various African and  
Eastern languages.

My feeling is that these are the types of things we should NOT be  
expecting authors to deal with. The font makers should spend the extra  
time to understand these issues and design their fonts accordingly.  
The input system software developers should do the same. The parsers  
for XML, HTML, CSS, Javascript and so on should normalize strings in a  
way that authors never have to think about these issues with Unicode.

Take care,
Rob

Received on Thursday, 5 February 2009 07:06:45 UTC