Re: [css3-speech] voice-family from Alan Gresley on 2011-08-02 (www-style@w3.org from August 2011)

From: Alan Gresley <alan@css-class.com>
Date: Tue, 02 Aug 2011 19:16:19 +1000
To: fantasai <fantasai.lists@inkedblade.net>
CC: Daniel Weck <daniel.weck@gmail.com>, www style <www-style@w3.org>
Message-ID: <4E37C063.4070204@css-class.com>
On 2/08/2011 3:20 AM, fantasai wrote:
> On 08/01/2011 09:40 AM, Daniel Weck wrote:
>>
>> On 20 Jul 2011, at 23:00, fantasai wrote:
>>
>>> On 07/06/2011 12:54 PM, Daniel Weck wrote:
>>>> Please have a look at the updated prose:
>>>>
>>>> http://dev.w3.org/csswg/css3-speech/#voice-props-voice-family
>>>
>>> I think my concern here is that using numerical ages gives a level of
>>> precision in specifying that is nowhere near the level of precision
>>> in voice matching. For example, at what numerical age does a male voice
>>> break?
>>>
>>> I think for this level it might make sense to revert back to keywords
>>> (which we can define as a specific numeric age for mapping to SSML),
>>> and introduce more fine-grained control later when the voice-matching
>>> algorithm is precise enough to support that.
>>
>> I agree that we should avoid using prose that appears to claim a level
>> of precision that we are effectively unable to provide. I propose the
>> following prose instead:
>>
>> ---
>> Possible values are 'child', 'young' and 'old', indicating the preferred
>> age category to match during voice selection. The mapping with SSML ages
>> is defined as follows: 'child' = up to 15 y/o, 'young' = between 16 and
>> 45 y/o, 'old' = 46 y/o onwards.
>> NOTE: The interpretation of the relationship between a person's age and
>> a recognizable type of voice cannot realistically be defined in a
>> universal
>> manner, as it effectively depends on numerous cultural and linguistic
>> variations. The values provided by this specification therefore represent
>> a simplified model that can be reasonably applied to a great variety of
>> speech locales, albeit at the cost of a certain degree of approximation.
>> Future versions of this specification may refine the level of precision
>> of the voice-matching algorithm, as speech processor implementations
>> become more standardized.
>> ---
>
> How about just mapping the keywords to specific numbers, and letting the
> voice-matching algorithm figure out the slack?
> 'child' = 6 years old
> 'young' = 24 years old
> 'old' = 75 years old
> or somesuch
>
> ~fantasai


It's not that easy. A child voice generally has a higher frequency and 
someone with a hearing impairment may not be able to hear high 
frequencies as well as they do, lower frequencies.

This is something I mentioned just today in another list message [1] 
where I write.

   | What is needed is something that plays sound at ever
   | increasing levels until a level is reach that is
   | desirable. This would have to be done over different
   | octaves.


Please take a look at this article.

http://en.wikipedia.org/wiki/Piano_key_frequencies


You would begin at A0 (27.5Hz) and end with A7 (3520Hz) and test the 
full range by octaves.

By user feedback, a dynamic range of hearing levels can be established 
for each octave.


A person of good hearing may have this dynamic range like so:


High                 A3       A4
               A2                     A5
          A1                               A6
Low  A0                                       A7


A person of poor hearing may have this dynamic range like so:


High
 

          A1   A2     A3
Low  A0                      A4
                                     A5
                                           A6
                                               A7


Note that in the later examples, A5, A6 and A7 are below low. In affect 
these frequencies are not audible but a simple hearing aid may extend 
this range. My father can not hear D5 but with a hearing aid, this is 
extended by more than an octave higher. Even with a hearing aid, a note 
like D7 played on a piano is harder to hear since the sounding of the 
key going down is louder than the note produced.

After a user does a sound check (I'm very serious here) over a range of 
frequencies from A0 (27.5Hz) to A7 (3520Hz), equalization can be done 
for various voice pitches in a dynamic range similar to that of someone 
with good hearing.


[1] http://lists.w3.org/Archives/Public/www-style/2011Aug/0034.html



-- 
Alan Gresley
http://css-3d.org/
http://css-class.com/
Received on Tuesday, 2 August 2011 09:16:47 UTC