Re: Armenian numbering: findings, recommendations and request to CSS from Robert J Burns on 2009-02-13 (www-style@w3.org from February 2009)

From: Robert J Burns <rob@robburns.com>
Date: Fri, 13 Feb 2009 17:25:12 -0600
To: W3C Style List <www-style@w3.org>
Cc: Leif Halvard Silli <lhs@malform.no>, fantasai <fantasai.lists@inkedblade.net>
Message-Id: <7259A07D-BA9E-42D4-924E-506AFD4F646B@robburns.com>
Hi Leif and fantasai,

Leif wrote:
> fantasai 2009-02-13 20.32:
>>  Aryeh Gregor wrote:
>>
>>>
>>>  Also, pragmatically, it would be very cumbersome to add  
>>> enumeration of
>>>  all an alphabet's letters for every language people can think up.
>>>  You'd have to have a different list-style-type for most languages  
>>> --
>>>  even Latin-based alphabets differ on what they think the exact  
>>> set of
>>>  letters is, and what their order is.  It seems like this would  
>>> greatly
>>>  bloat the spec.
>>>
>>  Yeah, I think if we're going down that route we should define  
>> keywords
>>  for the most commonly-used alphabetic orders, and introduce a  
>> functional
>>  notation for everything else. How often do we need, e.g. upper- 
>> norwegian,
>>  given that lists are usually less than 26 letters?
>>
>>  alpha("a-z")
>>  alpha("a-f,q-z")
>>  alpha("do,re,mi,fa,so,la,ti")
>>
>>>
>>
> Do you use 'alpha' for "latin alphabet"? Or could alpha be used
> for Cyrillic as well? If you are taking your pattern from the way
> RegEx/GREP is working, then remember that e.g. \p{Armenian}
> matches any character in the Armenian block.[1]
>
> Hence e.g.
>    alpha(armenian)
> could also be useful.
>>
>>  Unicode can fill in ranges, so unless there are a lot of scripts  
>> like
>>  Ethiopic, where every language seems to have picked its own order  
>> for
>>  the letters, this doesn't have to be that painful.
>>
>>
> Let's take one example: Slovak alphabet, about which Wikipedia
> says: "The lexicographic ordering of the Slovak alphabet is very
> similar to that of English": [2]
>
>    alpha("a-d,dz,e-h,ch,i-z)
>
> And there are several such alphabets.[3] It can be complicated.
> But on the whole, what you propose here would be very good to
> have. I would much rather see this implemented accross UAs than
> e.g. "upper-norwegian". (Although I also hope that we can get more
> good keywords.)
>
> Btw, why did you pick "alpha"? Why not "numb"? Or do you think
> that e.g. pure symbols should be excluded or have another name?

I think Unicode provides a lot of useful abstractions for putting  
something like this together. However, I'm not clear on how you Leif  
are using number and alpha here either. My understanding is that  
alpha, or using letters as an enumeration system, is not treating them  
as numbers (though perhaps loosely since they're enumerating), but  
still as letters. From what I can tell Armenian is however using  
letters as numerals as Roman numerals do). And since Unicode uses  
"number" to categorize specifically graphemes used as a numerals (7,  
0, ↀ), I think that is a useful distinction to follow. Also since  
Unicode provides language-specific (not merely script-specific)  
collations, I don't even think the Ethiopic case should be  
particularly troublesome here.

In terms of Unicode abstractions I think what we're looking for is:

1) the designation of a script (e.g., Latin or Ethiopic or Armenian)  
and the letters in that script (general category, "Lu", "Ll",  
"Lo" (leaving out "Lm" and "Lt" since those are not really of interest  
here).

2) the more focussed designation of a language which would limit the  
script to specific letters (through CSS provided criteria) and also  
provide a collation from the Unicode collation algorithm collations  
(the Unicode collation alone includes other characters in the set not  
just letters or letters specific to that language).

So that means all CSS needs to do is limit the specific letters used  
in an alphabetic (or more precisely in Unicode terms a lettered)  
enumeration. Unicode provides the rest. However as Leif suggested  
before some general naming scheme might be needed to allow some way to  
express this in a functional form as fantasai suggested (something to  
serve as the argument/arguments for an "alpha" or "lettered"  
function). So for example lettered(latn-no) or alpha(latn-no) could  
indicate the Latin script limited to Norwegian letters and sorted  
according to the Unicode Norwegian collation. The same thing could be  
accomplished for any Ethiopic based langauge (unless there's something  
else I'm missing there).

The interesting part I guess would be to see what languages fell  
outside this abstraction and needed further tailoring or its own  
approach. However, lettered enumerations seem fundamentally different  
than the Roman numeral system (and it sound like also the Armenian  
numeral system), but I imagine that both Armenian (as for Latin)  
would  enjoy also lettered enumerations. Perhaps I"m the only one  
confusing that here, but I'm having trouble following then.

Take care,
Rob
Received on Friday, 13 February 2009 23:25:51 UTC