Re: update to xml entities draft

On 6/23/2015 3:20 PM, Murray Sargent wrote:
> David Carlisle wrote that one could made definitions like
>
>> U+2102 DOUBLE-STRUCK CAPITAL C = Complex numbers
>>
>> Leaving U+1D53A free to be defined as a part of a generic alphabetic run as
>> MATHEMATICAL DOUBLE-STRUCK CAPITAL C
> One can't change the definitions of the math alphanumerics now since they are already encoded and Unicode has a stability guarantee. In addition they are widely used in technical documents as defined. We might have been able to get away with such definitions before the math alphanumerics were added to the Unicode Standard 3.1 back in March, 2001.  For Microsoft Office apps, I wrote routines to work around the separation of the math alphabetics into the LetterLike Symbols and math alphanumerics blocks and it's complicated and even error prone. So I really wish that we had done something along the lines David suggests. But it's clearly water over the dam at this point.
>
> +Asmus and Michel in case they want to defend Unicode's position of not duplicating characters.

Someone may have to forward this reply to the list.
> I'd argue that simplicity of implementation should play an important role in this regard. This isn't the only place where Unicode is over unified. But these complications do provide ways to keep programmers employed <grin>.

Unicode generally does not encode characters by usage. For example 
there's no distinction between period, decimal point, abbreviation point 
etc.. This reflects the underlying situation, to wit, that this is a 
case of the *same* symbol being used in different conventions.

The downside is that it is thus not possible to use plain text to 
capture which convention is intended (but nothing prevents anyone from 
providing rich-text markup). The upside is that data can't exhibit 
"random alternation" between identical looking symbols; experience has 
shown that this is a most likely outcome if "the same" item is encoded 
several times, based merely on convention.

In the existing case, when 2102 and friends were encoded in the 
Letterlike Symbols block, they were clearly intended as a subset of the 
double struck alphabet. The fact that some of the conventional meanings 
for characters from this subset are annotated in the nameslist does not 
detract from that.

It took a few versions of Unicode to better understand the best way to 
encode symbols and alphabets used for math. The unfortunate side effect 
of that is that the math alphabets are not sequential but have "holes".

In some cases, Unicode apparently does encode convention, for example 
the micro vs. Greek mu, or Ohm vs. Greek Omega. These have complicated 
histories. The desire to preserve the Latin-1 layout as an aid in 
migration overrode the normal reluctance to code by convention. The 
downside is that now users will use "random alternation" for the mu used 
as micro sign. Greek users will most certainly not use the Latin-1 code 
point for that purpose.

Some of the letterlike symbols should not have been coded in that block, 
but in the Squared abbreviations block. That is because their origin was 
fundamentally the special em-square set of units used in Asian 
standards.  In the early versions of Unicode, there was this idea of 
filtering out from such sets, any symbol that might be used 
"generically", that is, outside an Asian typographic environment.

For most such usages, the standard Latin (or Greek, for Ohm) letters 
would have been the correct characters, leaving Kelvin, Ohm, and 
Angstrom as specifically "squared" characters.

So, while there are exceptions to what has by now become the principle 
for new encodings, I would not call the treatment of math alphabets 
"over unified". It is rather an attempt to not needlessly repeat the 
"underunification" of micro, Kelvin, Ohm and Angstrom.

A./

Received on Friday, 26 June 2015 21:36:07 UTC