[css3-speech] voice-family selection and language

The SSML spec gives an algorithm for selecting voice families:
   http://www.w3.org/TR/speech-synthesis/#edef_voice

This algorithm is roughly approximated in the CSS3 Speech spec for 'voice-family':
   http://dev.w3.org/csswg/css3-speech/#voice-family

  # The ‘voice-family’ property is used to guide the selection of the voice to be
  # used for speech synthesis. The overriding priority is to match the language
  # specified by the xml:lang attribute as per the XML 1.0 specification [XML10],
  # and as inherited by nested elements until overridden by a further xml:lang
  # attribute.
  #
  # If there is no voice available for the requested value of xml:lang, the
  # processor should select a voice that is closest to the requested language
  # (e.g. a variant or dialect of the same language). If there are multiple
  # such voices available, the processor should use a voice that best matches
  # the values provided with the ‘voice-volume’ property. It is an error if
  # there are no such matches.

Firstly, the prose here needs some tightening up. Copying the list structure
from SSML is probably a good idea.

Second, CSS doesn't use xml:lang directly, since CSS (unlike SSML) is not an
XML language. Looking up "the language of the element" is an abstract
operation; the closest thing we have to a definition is in Selectors Level 3:
   http://www.w3.org/TR/css3-selectors/#lang-pseudo

Third, the SSML algorithm is somewhat imprecise about what "best matches"
means. We either need a definition here, or we need a note that this is
undefined.


Lastly, we need to figure out, for CSS, when the voice family is recalculated.
In SSML, it's recalculated on every element, which means that if an element
has a different language value than its parent, the voice family changes. The
SSML spec notes that this is not always desirable (e.g. a French phrase
embedded in an English sentence) and in such cases suggests that the xml:lang
attribute not indicate the language of the foreign phrase, thus avoiding the
recalculation.

This isn't particularly practical in CSS. We don't actually want to discourage
people from marking up their documents correctly, even if many don't bother,
and messing with the markup to change the speech rendering interferes with the
separation of content and style.

Probably the simplest solution would be to add a 'match-parent' keyword to
'voice-family'. This would add the 'match-parent' keyword to the inherited
value for the computed value, and would prevent the voice selection from
being recalculated.

We could also consider something similar to the CSS3 Font's 'font-language-override'
property, e.g.

   voice-language: auto | <language-code> | inherit;
   inherited: yes
   computed value: as specified

   auto -
     The used value is taken from the language of the element, or some
     UA-chosen value if unknown. (The computed value is the keyword 'auto'.)

I'm somewhat less in favor of this option, as
   a) 'match-parent' seems easier to use (imho)
   b) 'match-parent' is just a keyword instead of an additional property
   c) you can do more intelligent things with 'match-parent' if you have the
      ability. E.g., use French phonics to map the embedded phrase to the
      closest English phonemes, so "à propos" could be rendered as
      "ah pro-POE" instead of "a PROP-uss".
But it's something to consider.

~fantasai

Received on Monday, 2 May 2011 18:53:34 UTC