numerals for microformats from Robert Burns on 2007-07-11 (public-html@w3.org from July 2007)

From: Robert Burns <rob@robburns.com>
Date: Tue, 10 Jul 2007 22:37:21 -0500
To: HTML Working Group <public-html@w3.org>
Message-Id: <46549559-08E5-4E5B-9914-3F0BCC549E36@robburns.com>
On Jul 10, 2007, at 7:59 AM, Robert Burns wrote:
> On Jul 10, 2007, at 7:40 AM, Geoffrey Sneddon wrote:
>> A string is a valid ratio if it consists of either one of more  
>> characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE  
>> (9) followed by a denominator punctuation character (see table  
>> below), or two valid unsigned integers separated by one or more  
>> characters in Unicode character class Zs.
>> ]]
>
> The Unicode character class should probably be Unicode general  
> category. I'm not familiar with the phrase "character class" in  
> that context. Also for internationalization reasons, we should  
> probably include all digits in Unicode: everything in the Nd  
> (Number, decimal digit) category. This includes the characters  
> listed already, but it also includes the various Indic-Arabic  
> character variants in many other scripts. Those variants have the  
> same relevant properties (e.g., numeric value) as the ASCII decimal  
> digits.

This post led to confusion on the IRC, so let me clarify. Since the  
draft currently supports several internationalized character variants  
for the denominator punctuation character, my suggestion was for the  
draft to also support the character variants of the (ASCII  region)  
Indic-Arabic numerals (I think there are around 23-29 sets of them  
depending on how you count them).  These all share the same  
properties with the ASCII encoded numerals, but are expected to  
display a glyph more appropriate  for the resident script (why  
Unicode did this I have no idea). I would add that I think this would  
probably apply equally to the other number data types in microformats  
section (not just ratios).

These characters are all listed as general category Nd. These are the  
same Indic-Arabic numerals that you're all used to using (nothing to  
be frightened by). They share the same numeric property values as the  
ASCII digits (0-9). They combine to form ratios and decimal numerals  
in the same manner as the familiar ASCII digits. And again, they're  
all clearly marked with numeric values and general category  
properties, so they should not be difficult to support in a proper  
Unicode implementation.

On the IRC channel straw-man arguments were made about other types of  
numerals (duodecimal Quenya  and Roman Numerals). Other (non-Indic- 
Arabic) numerals in Unicode all have general categories  other than  
'Nd'. For example, Nl refers to numerals such as Hangzhou, Roman and  
Greek. These do not share the same semantics with the Arabic-Indic  
numerals and it would be difficult (though not prohibitively so;  
particularly the Hangzhou) to add support for these. . 'No' refers to  
other numerals and mostly consists of vulgar fractions (compatibility  
characters)  and  stylized Arabic-Indic numerals (such as circled  
numerals and dingbat numerals).

Other ideograph numeral characters are included in the Unicode Han  
database and have several properties to indicate their numeral  
status. These too would be more difficult to incorporate into an  
implementation.

The characters I was referring to are simply the Arabic-Indic  
variants of the ASCII encoded numerals (0-9). We could limit that to  
Nd category characters in the BMP, if that would make implementation  
simpler. However, the numeric value to associate with these numerals  
can be read right off the characters UCD numeric property.

Take care,
Rob-
Received on Wednesday, 11 July 2007 03:37:32 UTC