Re: Unicode Normalization from Robert J Burns on 2009-02-05 (public-i18n-core@w3.org from January to March 2009)

From: Robert J Burns <rob@robburns.com>
Date: Thu, 5 Feb 2009 17:51:59 -0600
To: Jonathan Kew <jonathan@jfkew.plus.com>
Cc: Benjamin Blanco <benjo316@gmail.com>, Anne van Kesteren <annevk@opera.com>, Aryeh Gregor <Simetrical+w3c@gmail.com>, public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <74110D88-25FE-4D30-8471-6D73A52FC3E5@robburns.com>

Hi JK,

On Feb 5, 2009, at 4:18 PM, Jonathan Kew wrote:

> On 5 Feb 2009, at 14:02, Benjamin Blanco wrote:
>
>> On Thu, Feb 5, 2009 at 1:06 AM, Robert J Burns <rob@robburns.com>  
>> wrote:
>> Hi Benjamin,
>>
>> On Feb 4, 2009, at 9:17 PM, Benjamin wrote:
>>> Also, I can see a difference between the characters; The two  
>>> brackets at the top and the one on the bottom left are duller,  
>>> while the other three are sharper. This difference is apparent in  
>>> both the browser and the text editor(Not sure if it matters,  
>>> though).
>>
>> I would say that is a bug in your font. Fonts, by using separate  
>> glyphs for canonically equivalent characters, contribute to the  
>> confusion authors face when creating content. The glyph  
>> distinctions lead authors to treat the characters semantically  
>> distinct (which shouldn't happen). Fonts play an important role in  
>> this (on par with input systems) since the fonts control the glyphs  
>> used. For example if a font uses the same glyphs for "½" as the  
>> font maker uses for the compatibility equivalent sequence "1⁄2",  
>> this helps with Unicode authoring. It is remarkable how few font  
>> makers take minimal amount of time necessary to do this.
>
> Fully comprehending and addressing issues of Unicode-to-glyph  
> mapping, canonical-equivalent sequences and alternatives, etc,  
> requires far from a "minimal amount of time" for font makers.

I'm sorry. I didn't mean to imply that it was a small amount of work  
to understand all of this. Clearly it is not. What I meant by that was  
that once someone has become a font maker (and therefore necessarily  
achieved a certain level of understanding about Unicode and Unicode  
imaging), then it is a minimal amount of work to check that  
canonically equivalent (and in some cases compatibility equivalent)  
share the same rendering (or at least the same rendering up to a  
relevant transformation for the compatibility equivalent characters).

> Also, most fonts are targeted at a particular market (such as  
> Western Europe), and make no claim to support languages or writing  
> systems outside this area. Even in the non-Latin world, fonts are  
> developed for limited markets; for example, an Arabic-script font  
> might support Arabic, Persian, and Urdu, but not necessarily the  
> Arabic-script orthographies of West African languages. However, as  
> browser developers we are (or should be) aiming to serve a worldwide  
> market, and this does come with additional costs.

Agreed. However even though fonts necessarily target subsets of the  
Unicode repertoire, they should always map the glyphs to canonical  
equivalents: simply because it is such a trivial thing to do once  
everything else about the font has been completed. It doesn't require  
any additional glyphs, but simply a few bytes added to a glyph mapping  
table.

>> This is a similar problem to font/glyph issues outlined earlier by  
>> Andrew Cunningham with various African and Eastern languages.
>>
>> I've tried several different fonts, and they all render the glyphs  
>> differently, despite canonical equivalence.
>
> This is somewhat tangential to the real issue, but FWIW.... I  
> suspect that in most (or perhaps all) cases, what's really happening  
> is that the font you're using does not support the characters U+3008  
> and U+3009, and your software is performing a font fallback and  
> rendering these from its default CJK font instead. So it's not that  
> font developers are providing different glyphs for canonically- 
> equivalent characters, but rather, they are not necessarily  
> supporting the equivalent characters at all.

I hadn't thought of that, but you're probably right. However this is  
either 1) a variation on the same bug I described earlier or 2) a font  
that is old and not yet updated to support U+3008 and U+3009. Again,  
an updated font, if it supports a particular character, should support  
all of canonically equivalent characters for that character since it  
does not require producing another glyph, but simply adding a mapping  
for an already designed glyph to another character (or character  
sequence).

But I think you're right that a likely explanation is that what  
Benjamin witnessed was caused by an older font rendering the NFC  
characters and caused a font fallback to a new font that simply had a  
different glyph for the two canonically equivalent characters.  
Normalization in the text processor would have avoided this issue as  
well.

Take care,
Rob

Received on Thursday, 5 February 2009 23:52:44 UTC