Re: Unicode Normalization from Robert J Burns on 2009-02-06 (public-i18n-core@w3.org from January to March 2009)

From: Robert J Burns <rob@robburns.com>
Date: Thu, 5 Feb 2009 20:37:22 -0600
To: Andrew Cunningham <andrewc@vicnet.net.au>
Cc: Jonathan Kew <jonathan@jfkew.plus.com>, Benjamin Blanco <benjo316@gmail.com>, Anne van Kesteren <annevk@opera.com>, Aryeh Gregor <Simetrical+w3c@gmail.com>, public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <600A4DED-8B0E-4085-9582-D91C8D6E4C3E@robburns.com>
Hi Andrew,

On Feb 5, 2009, at 6:27 PM, Andrew Cunningham wrote:

>
>
> Robert J Burns wrote:
>>
>>
>> I hadn't thought of that, but you're probably right. However this  
>> is either 1) a variation on the same bug I described earlier or 2)  
>> a font that is old and not yet updated to support U+3008 and U 
>> +3009. Again, an updated font, if it supports a particular  
>> character, should support all of canonically equivalent characters  
>> for that character since it does not require producing another  
>> glyph, but simply adding a mapping for an already designed glyph to  
>> another character (or character sequence).
>>
> why would U+3008 and U+3009 share the same glyph shape as the  
> canonically equivalent characters? Not sure this is necessary, nor  
> even desirable in many contexts.

It's important to understand what Unicode means by canonically  
equivalent, but before I go into that let me first say that even if it  
was desirable to develop different glyphs for canonically equivalent  
characters (which it definitely is not), it still takes very little  
effort for a font maker to include mappings to all canonically  
equivalent characters whenever that font maker opts to not provide  
another distinct glyph. In other words font makers should neglect to  
include glyph mappings to all of the canonically equivalent characters  
for which they have designed and provided a glyph.

> But harmonising typographic design within multi script fonts can be  
> problematic at the least. One of the reasons its better to use  
> appropriate fonts for the language and contents of a document.
>
> The shape of each glyph is a design consideration by the font  
> developer base don the context of its usage.
>
> I'd assume the designer would develop the glyph and its metric to  
> suit its usage, and harmonise with the script it is most likely to  
> be used with.

I understand, but it's important to clearly understand what Unicode  
means by canonically equivalent characters. These are equivalent in  
the sense that they have the same meaning in the text. As I've said  
before the state of normalization and canonical equivalence is a mess,  
but there's no way to continue contributing to the mess. Let's try to  
fix it instead of making it worse. When a font designer uses separate  
glyphs for canonically equivalent characters it undermines the Unicode  
Standard. Rather than using a font to undermine the standard (or a  
parser or another processor) wouldn't it be better to take for the  
font maker to take complaints directly to the Unicode Consortium?  
Explain to the Unicode Consortium why these characters shouldn't be  
treated as equivalents. Simply using canonically equivalent characters  
to add more glyphs to a font is not a good practice to follow at all.

> The characters may be canonically equivalent, but this does not mean  
> that they need to be visually identical or share a glyph.
>
> For instance: a font may use the same glyph for <U+0065 U+0302 U 
> +0301> and <U+1ebf>. Alternatively it may use different glyphs for  
> each. It hinges on the intention of the font's designer and their  
> intended audience and use of the font.

For the non-singleton canonical decompositions (like this one) the  
only intent of the designer I can imagine is undermining the Unicode  
standard here. If their was some other reason to expose separate  
glyphs there's a Private Use Area that would be appropriate for that.  
That character is equivalent to that character sequence in that it is  
not supposed to imply (and no user should infer) any separate meaning  
from those two different character sequences. Using separate glyphs  
undermines that equivalence. Presumably a font could also encode  
different glyphs for the following character sequences (as in 1 or  
more character):

1) Ệ (U+1EC6)
2) Ê (U+00CA) (U+0323)
3) Ẹ (U+1EB8) (U+0302)
4) E (U+0045) (U+0323) (U+0302)
5) E (U+0045) (U+0302) (U+0323)

But this would be an abuse of the Unicode Standard. These are all  
canonically equivalent characters sequences and should not be used by  
font designers to express their intent or target different audiences.  
The font designer should either use the private use are for exposing  
the glyphs if plain text is important or use a rich text protocol to  
assign different glyphs to different instances of these canonically  
equivalent character sequences, but don't map different  
representations of canonically equivalent character sequences to  
different glyphs.

Likewise authors should not be relying on such character reordering  
and canonically equivalent substitution to express different meaning  
or different visual effects. This too is an abuse of the Unicode  
Standard. However, I feel the authors have a better excuse if they  
don't understand the Unicode Standard than a font maker. If the font  
doesn't display different glyphs, the author won't be lured into such  
shenanigans.

> But then a well designed font (intended for generic use of Latin  
> script languages) will have more than one glyph available for the  
> character <U+1ebf>.

That's fine, as long as we understand that such alternate glyphs are  
only accessible through non-plaintext protocols and font designers  
should not abuse the Unicode Standard to make them accessible in plain  
text.

Take care,
Rob
Received on Friday, 6 February 2009 02:38:06 UTC