- From: Robert Burns <rob@robburns.com>
- Date: Tue, 10 Jul 2007 22:37:21 -0500
- To: HTML Working Group <public-html@w3.org>
On Jul 10, 2007, at 7:59 AM, Robert Burns wrote: > On Jul 10, 2007, at 7:40 AM, Geoffrey Sneddon wrote: >> A string is a valid ratio if it consists of either one of more >> characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE >> (9) followed by a denominator punctuation character (see table >> below), or two valid unsigned integers separated by one or more >> characters in Unicode character class Zs. >> ]] > > The Unicode character class should probably be Unicode general > category. I'm not familiar with the phrase "character class" in > that context. Also for internationalization reasons, we should > probably include all digits in Unicode: everything in the Nd > (Number, decimal digit) category. This includes the characters > listed already, but it also includes the various Indic-Arabic > character variants in many other scripts. Those variants have the > same relevant properties (e.g., numeric value) as the ASCII decimal > digits. This post led to confusion on the IRC, so let me clarify. Since the draft currently supports several internationalized character variants for the denominator punctuation character, my suggestion was for the draft to also support the character variants of the (ASCII region) Indic-Arabic numerals (I think there are around 23-29 sets of them depending on how you count them). These all share the same properties with the ASCII encoded numerals, but are expected to display a glyph more appropriate for the resident script (why Unicode did this I have no idea). I would add that I think this would probably apply equally to the other number data types in microformats section (not just ratios). These characters are all listed as general category Nd. These are the same Indic-Arabic numerals that you're all used to using (nothing to be frightened by). They share the same numeric property values as the ASCII digits (0-9). They combine to form ratios and decimal numerals in the same manner as the familiar ASCII digits. And again, they're all clearly marked with numeric values and general category properties, so they should not be difficult to support in a proper Unicode implementation. On the IRC channel straw-man arguments were made about other types of numerals (duodecimal Quenya and Roman Numerals). Other (non-Indic- Arabic) numerals in Unicode all have general categories other than 'Nd'. For example, Nl refers to numerals such as Hangzhou, Roman and Greek. These do not share the same semantics with the Arabic-Indic numerals and it would be difficult (though not prohibitively so; particularly the Hangzhou) to add support for these. . 'No' refers to other numerals and mostly consists of vulgar fractions (compatibility characters) and stylized Arabic-Indic numerals (such as circled numerals and dingbat numerals). Other ideograph numeral characters are included in the Unicode Han database and have several properties to indicate their numeral status. These too would be more difficult to incorporate into an implementation. The characters I was referring to are simply the Arabic-Indic variants of the ASCII encoded numerals (0-9). We could limit that to Nd category characters in the BMP, if that would make implementation simpler. However, the numeric value to associate with these numerals can be read right off the characters UCD numeric property. Take care, Rob-
Received on Wednesday, 11 July 2007 03:37:32 UTC