Re: abbreviation exposition and pronunciation from Robert Burns on 2007-08-20 (public-html@w3.org from August 2007)

From: Robert Burns <rob@robburns.com>
Date: Mon, 20 Aug 2007 04:02:32 -0500
To: Karl Dubost <karl@w3.org>
Cc: Matthew Raymond <mattraymond@earthlink.net>, "Gregory J. Rosmaita" <oedipus@hicom.net>, public-html@w3.org
Message-Id: <220D4562-FCAD-420B-A2FC-833F92D8C54A@robburns.com>
H Karl (et al),

On Aug 19, 2007, at 5:48 PM, Karl Dubost wrote:

> Matthew Raymond (19 août 2007 - 17:09) :
>> | <acronym title="[...]">XHTML</abbr>TM
>>
>>    Not to mention that "TM" is itself an abbreviation, and should be
>> marked up as such.
>
> and a character in Unicode.: ™
> I wonder how screen readers handle the Trademark character?
> Someone could test?

Voiceover just read it as "trademark sign". However, when I placed  
the &trade; character entity reference into an HTML file and loaded  
in Safari it said "trademark symbol" for some strange reason (the  
character entity reference &trade; is supposed to map to the same  
Unicode character).

One concern I have about this however, is that in some Unicode  
circles, the concept of canonical and compatibility decomposition is  
overly promoted. With the compatibility decomposition, the view is  
that something like this ™ should be better authored as T + M + using  
superscript and other formatting outside the scope of Unicode.  In  
general, I think the concepts of canonical and compatibility  
decomposition are ill-conceived. Recently Unicode has begun a better  
approach through a new generalized "character folding" algorithm[1].

To me this should replace most every concept of canonical and  
compatibility decompositions. Authors/users will never understand the  
intricacies of the Unicode standard  (nor the confusion that leaks  
out of those standards due to the various lingering disagreements)  
sufficiently to be able to know not to use characters that could be  
come unified through misguided applications using canonical and  
compatibility decomposition mappings (especially for singleton  
characters). Consider the example of an author using roman numerals  
for historical documentation. A renegade application might carelessly  
translate the roman numerals to the Latin letters sharing similar  
glyphs: thus eliminated actual semantic distinctions. There is  
therefore a risk of lossful data handling if application developers  
or spec writers treat canonical mappings as indistinguishable. If  
authors use different characters, then they often meant to use  
different characters.

Also for many of the compatibility decomposition mappings, the  
standard again is ambivalent: struggling internally between two  
disparate approaches (composition and decomposition). This  
ambivalence leads to ambiguity in how to use characters. Though I  
would say that the compatibility decomposable characters located in  
the compatibility decomposition area (U+F900 – U+FFFE) of the Basic  
Multilingual Plane (BMP) are properly discouraged (not quite  
deprecated in Unicode lingo) and prohibited from XML NCNames too.

So back to the ™ grapheme.  The trademark symbol falls in the symbol  
area of the BMP.  In its cooperation with ISO, Unicode found itself  
assigning all sorts of code points to symbol characters, but one gets  
the feeling, that's not what Unicode wanted to be doing. Instead it  
wanted to provide a good mapping of all of the World's writing  
systems  within the BMP. Instead 10s of thousands of characters were  
included in the BMPs 65,000 capacity that were simply there for  
compatibility or other reasons other than the World's writing  
systems. Unicode seems to be somewhat concerned at the virtually  
boundless number of symbols that could be encoded compared to the  
fairly bounded number of characters needed for the World's writing  
systems. In my view however, I think symbols like ™ are important  
enough to deserve their own code point in Unicode. The ™ symbol could  
even receive language specific presentational glyphs if necessary,  
though remaining semantically the same character. While the BMP  
includes over 3,000 symbols, Unicode could easily afford to devote an  
entire Supplementary Symbolic Plane of 65,000 characters to just  
symbols. That would probably meet much of the World's symbolic needs.

Sorry for the tangential email. A few points to consider fro HTML5  
though. 1) Stay away from the compatibility area for our NCNames :-)  
(OK there probably wasn't a danger of that). 2) Be wary of those  
pushing canonical and compatibility decompositions: especially for  
singleton characters outside the compatibility area, but also for  
compatibility decompositions like ™ -> T + M + formatting. Discussion  
like this have surfaced from time to time surrounding HTML.

Take care,
Rob

[1]: <http://www.unicode.org/reports/tr30/>
Received on Monday, 20 August 2007 09:02:46 UTC