- From: Robert Burns <rob@robburns.com>
- Date: Mon, 20 Aug 2007 04:02:32 -0500
- To: Karl Dubost <karl@w3.org>
- Cc: Matthew Raymond <mattraymond@earthlink.net>, "Gregory J. Rosmaita" <oedipus@hicom.net>, public-html@w3.org
H Karl (et al), On Aug 19, 2007, at 5:48 PM, Karl Dubost wrote: > Matthew Raymond (19 août 2007 - 17:09) : >> | <acronym title="[...]">XHTML</abbr>TM >> >> Not to mention that "TM" is itself an abbreviation, and should be >> marked up as such. > > and a character in Unicode.: ™ > I wonder how screen readers handle the Trademark character? > Someone could test? Voiceover just read it as "trademark sign". However, when I placed the ™ character entity reference into an HTML file and loaded in Safari it said "trademark symbol" for some strange reason (the character entity reference ™ is supposed to map to the same Unicode character). One concern I have about this however, is that in some Unicode circles, the concept of canonical and compatibility decomposition is overly promoted. With the compatibility decomposition, the view is that something like this ™ should be better authored as T + M + using superscript and other formatting outside the scope of Unicode. In general, I think the concepts of canonical and compatibility decomposition are ill-conceived. Recently Unicode has begun a better approach through a new generalized "character folding" algorithm[1]. To me this should replace most every concept of canonical and compatibility decompositions. Authors/users will never understand the intricacies of the Unicode standard (nor the confusion that leaks out of those standards due to the various lingering disagreements) sufficiently to be able to know not to use characters that could be come unified through misguided applications using canonical and compatibility decomposition mappings (especially for singleton characters). Consider the example of an author using roman numerals for historical documentation. A renegade application might carelessly translate the roman numerals to the Latin letters sharing similar glyphs: thus eliminated actual semantic distinctions. There is therefore a risk of lossful data handling if application developers or spec writers treat canonical mappings as indistinguishable. If authors use different characters, then they often meant to use different characters. Also for many of the compatibility decomposition mappings, the standard again is ambivalent: struggling internally between two disparate approaches (composition and decomposition). This ambivalence leads to ambiguity in how to use characters. Though I would say that the compatibility decomposable characters located in the compatibility decomposition area (U+F900 – U+FFFE) of the Basic Multilingual Plane (BMP) are properly discouraged (not quite deprecated in Unicode lingo) and prohibited from XML NCNames too. So back to the ™ grapheme. The trademark symbol falls in the symbol area of the BMP. In its cooperation with ISO, Unicode found itself assigning all sorts of code points to symbol characters, but one gets the feeling, that's not what Unicode wanted to be doing. Instead it wanted to provide a good mapping of all of the World's writing systems within the BMP. Instead 10s of thousands of characters were included in the BMPs 65,000 capacity that were simply there for compatibility or other reasons other than the World's writing systems. Unicode seems to be somewhat concerned at the virtually boundless number of symbols that could be encoded compared to the fairly bounded number of characters needed for the World's writing systems. In my view however, I think symbols like ™ are important enough to deserve their own code point in Unicode. The ™ symbol could even receive language specific presentational glyphs if necessary, though remaining semantically the same character. While the BMP includes over 3,000 symbols, Unicode could easily afford to devote an entire Supplementary Symbolic Plane of 65,000 characters to just symbols. That would probably meet much of the World's symbolic needs. Sorry for the tangential email. A few points to consider fro HTML5 though. 1) Stay away from the compatibility area for our NCNames :-) (OK there probably wasn't a danger of that). 2) Be wary of those pushing canonical and compatibility decompositions: especially for singleton characters outside the compatibility area, but also for compatibility decompositions like ™ -> T + M + formatting. Discussion like this have surfaced from time to time surrounding HTML. Take care, Rob [1]: <http://www.unicode.org/reports/tr30/>
Received on Monday, 20 August 2007 09:02:46 UTC