Re: Maybe Why Speech Synthesisers Are So Difficult To Get Used To from Al Gilman on 2004-12-28 (wai-xtech@w3.org from December 2004)

From: Al Gilman <Alfred.S.Gilman@IEEE.org>
Date: Tue, 28 Dec 2004 13:57:36 -0500
To: "Will Pearson" <will-pearson@tiscali.co.uk>, <uvip@yahoogroups.com>, <wai-xtech@w3.org>
Message-Id: <p06110419bdf7554f2821@[10.0.1.2]>
[on the process]

Please pick one of these lists for follow-up. We try to discourage
cross-posting, and I don't yet see an overriding reason to keep the
thread cross-list at the moment.

[on the product]

The conventional wisdom, as I understand it, is that Perfect Paul and
friends actually do *better* than the average native speaker in
reproducing well-formed and discriminable phonemes. The speech is
unnatural, but not under-performing at the phoneme reproduction
level. Rather than poor phonemes, what makes it hard to follow the
robot speech at first is the absence of some other stuff;
higher-level affects such as prosody. This modulation exposes
patterns and properties at a far higher level of aggregation
(sentence, paragraph) than the phonemes. [Gestalt: context affects
the perception of detail]

But the usability issue surrounding the training effects are very
real. In screen reader use the speech rate is controlled by the
user's preference settings, not by the web page author. Being
embedded in a pull protocol makes it easy to adapt the speech rate to
each user. Audio descriptions that have to keep up with real-time
events don't have so easy a time of it.

See also UAAG
http://www.w3.org/TR/UAAG10/guidelines.html#tech-configure-speech-rate

It is very important that speech rate be user-configurable because of
the tension between two factors:

1. time is scarce. The aphorism is that "time is money."

2. As you pointed out, comprehension breaks down above some threshold
speed, but that speed varies considerably among users.

The flip side of the usability equation is the calculus of markets.
Realism beyond Perfect Paul is of limited interest to the
professional who uses their screen reader day in and day out in a
computer-centric desk job. It becomes attractive to strip gratuitous
fonts, bells, and whistles and just motor through the text at a great
rate. On the other hand, in the multimodal kiosk market, sounding
more like a real person may mean the difference between user
acceptance and user rejection, because these users have to succeed
immediately in a casual encounter, and they have more ready access to
competing options.

Al

At 4:18 PM +0000 12/28/04, Will Pearson wrote:
>Hi;
>
>I've recently been reading a paper that was published in the on-line 
>journal, Nature Neuroscience.  The subject of the paper was early 
>language acquisition, and the cognitive and neurological processes 
>that go on.  The review of the paper is at:
><http://www.nature.com/cgi-taf/Dynapage.taf?file=/nrn/journal/v5/n11/abs/nrn1533_fs.html>http://www.nature.com/cgi-taf/Dynapage.taf?file=/nrn/journal/v5/n11/abs/nrn1533_fs.html
>Unfortunately, you need a subscription to read the full article.
>
>One of the conclusions was that we become conditioned to 
>discriminate amongst the phonemes used in speech, based on the 
>phonemes we are exposed to at an early age.  It is still possible to 
>learn to discriminate amongst phonemic blocks later in life, but 
>this task is harder for adults than it is for infants.
>
>Being used to discriminating between the phonemes used in our native 
>languages, and the fact that learning to discriminate between 
>different phonemes becomes increasingly difficult later in 
>life makes foreign language acquisition harder for adults than 
>infants.  The task is not just to become conditioned to associating 
>meaning with the different word sounds, but also to discriminate 
>between the different phonemic blocks that a foreign language may 
>use.  For this reason, foreign languages are easier to learn if the 
>speech rate of the speaker is slowed down, making discrimination 
>between the phonemes easier.
>
>This process of slowing down speech also occurs for new users of a 
>TTS synthesiser, or users new to a different TTS synthesiser.  One 
>likely cause for this may be that TTS synthesisers haven't correctly 
>replicated the phonemic blocks that we are used to in our native 
>languages.  Therefore, we have to learn to discriminate between 
>phonemic blocks that are slightly different to those that we are 
>used to.  Once we have learnt to distinguish between the different 
>phonemes, it is then possible to listen to synthetic speech at quite 
>high rates.
>
>This hypothesis, if true, poses several usability issues for certain 
>groups.  Firstly, the transfer of information for all TTS users is 
>going to be significantly slower than human to human speech for a 
>while, whilst the user's brain maps to the new phonemic blocks. 
>Secondly, this remapping is based on exposure times.  Those that are 
>exposed to a TTS synthesiser more frequently and for longer 
>durations will likely distinguish between the different phonemes 
>quicker than those who only use a TTS engine infrequently, a 
>possible problem for TTS use by sighted, but "eyes free", users.
>
>Will
Received on Wednesday, 29 December 2004 05:25:03 UTC