This version:
Latest version:
Previous version:
this is the first publication
Editors:
Bill Scholz, Unisys
Corporation. mailto: Bill.Scholz@Unisys.com
Brad Porter, TellMe
Corporation. mailto: Brad@Tellme.com
Contributors:
See Acknowledgements
Copyright ©
2003 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability,
trademark,
document use
and software
licensing rules apply.
This document describes fundamental
requirements for the specification under development in the W3C Voice Interface
Objects Task Force, an activity shared jointly by the Voice Browser Working Group and the Multimodal Interaction Working Group.
This section describes the status of this
document at the time of its publication. Other documents may supersede this
document. The latest status of this document series is maintained at the W3C.
1. General Requirements
2. Automatic Speech Recognition (ASR) Requirements
2.1
ASR Overview
2.2
Context-free Grammar (CFG) Recognizers
2.3
Statistical Language Model (SLM) Recognizers
2.4
Dictation Recognizers
2.5
Language Identification
3. Text-To-Speech (TTS)
Requiements
4. Speaker Verification
Requirements
5. Audio Recording and Playback Requirements
6.
Audio Mixer Requirements
7.
Glossary
8.
Acknowledgements
Voice
Interface Objects include the components of
a runtime deployment architecture created to support standalone speech
applications or the speech component of multimodal interaction applications,
which perform the following activities:
Voice Interface Objects are a
collection of software middleware components introduced between the Interaction
Manager and mode-specific speech and audio interaction controls as illustrated
in Figure 1.
Voice Interface Objects thus
constitute a container class which
encapsulate all low level audio interaction controls (both Input and Output).
Any Voice Interface Object must support the following:
1.
Object instantiation
2.
Object activation
3.
Object deactivation
4.
Ability to bind data to the invoking objects data
structure.
Voice Interface Objects should
support the delegation of simple interaction idioms. For example, the Interaction Manager should be able to delegate
barge-in support to the voice interface objects; this implies that a speech
recognition engine must be able to route Voice Activity Detection events to the
speech synthesis engine to stop the prompt.
- asynchronously process events
- DOM bindings
- does not block operation of other components defined in the page
- able to generate events to the parent context
- declarative object representation in XML
Voice Interaction Objects are a class of components which collect input
from a user via voice input and audio output.
They may range from a simple recognition collector, to a complex
interaction dialog. That voice input
can then be coordinated through an interaction manager with the input from
other modalities.
- able to support multiple languages
- associate a confidence score with data returned to parent context (as voice
is not a deterministic input)
- able to return partial results to parent context
- able to return a set of potential results to the parent context
- adhere to general voice input collection requirements
- able to perform basic tapered prompt dialogs
- able to conditionally traverse from one dialog context to another
- able to sequence multiple interactions in order to generate a single
result
Messages
sent to an object to are used to make requests. Messages are sent by objects to
indicate events.
(VIO-A1): The VIO
framework must support the ability to enable multiple speech recognition tasks
(eg., functionality) simultaneously off
the same audio source.
Motivation:
This
allows speech recognition objects to handle specific tasks, such as one
recognizer for navigation commands and another for the current dialog task.
(VIO-A2): ASR Objects must have the ability to
support multiple simultaneously active grammars.
Motivation:
At
one level this offers syntactic sugar and saves need for constructing grammar
rules to combine several sets of grammars
At
another level, this provides an alternative to (A1) so the same speech object
can handle a wide variety of kinds of utterances that would otherwise be
handled by multiple speech objects each dedicated to a given kind of utterance.
(VIO-A3): ASR Objects must support the ability to
associate processing directives
with active grammars (eg., with
objects)
Motivation:
In
this model, when you enable the grammar, you specify what object/call-back to
send the result to. This allows authors to use separate dialog handlers for
different purposes, e.g. navigation dialogs versus date pickers.
(VIO-A4): ASR Objects
must support the ability to take uncertainty into account when multiple
grammars are simultaneously active and associated with their own handlers eg, deal with uncertainty.
<note:
Scott: let the interaction manager handle it?>
<note:
there is some concern over this being a strict requirement.>
Motivation:
The
simplicity of associating different handlers with different grammars is
appealing. The uncertainties inherence in speech recognition presents a complication.
A given utterance might match two grammars each of which have a different
handler.
A
simple approach is to pick the hypothesis with the highest confidence score,
possibly weighted by relative priorities of the respective handlers. Another approach
is to involve an adjudicator function provided by the application developer s
part of the interaction manager.
(VIO-A5): ASR Objects must support the ability to push a grammar to a
recognizer as a stream.
Motivation:
This
might be useful for dynamically constructed grammars and contrasted with the
pull model offered by passing a URI to the recognizer for retrieving the
grammar from a server.
<note:
Stephen: is this the same as remote vs inline grammar? (SpeechSC permits a push of a grammar)>
(VIO-A6): ASR Objects must support the ability to control which
events deactivate a grammar.
Motivation:
In
a system directed dialog, the grammar is usually deactivated when the user stops talking or a timeout occurs. It
may also be deactivated when an error occurs, e.g. out of vocabulary or
malformed speech is detected.
In
an open dialog, the user may be able to say something at any time, so the
grammar remains active indefinitely until it is explicitly deactivated.
In
a dictation system, the same thing occurs.
(VIO-A7): Voice Interface Objects must be
able to do perform more than one type of processing with the same audio stream
at one time.
Motivation:
It
should be possible to perform simultaneous recording and recognition or
simultaneous recognition and verification.
This capability would be very useful for system tuning.
(VIO-A8): ASR objects should provide the
following categories of recognition events:
-
No audio source, e.g. user forgot to plug in the microphone
or that the telephone connection has been broken.
-
Too noisy, i.e. don't expect speech to work
-
Time out, i.e. user didn't respond in a timely fashion
-
Started talking, an event signaling the time that the user
started to talk
-
Stopped talking, an event signaling the time that the user
stopped talking (or paused for too long)
-
Bad recognition, the recognizer failed to match the
utterance to the grammar
-
Partial recognition, i.e. an incomplete match
-
N-best recognition results
-
Dictation chunk, such as a sentence or other convenient chunk
in a stream of such chunks when using a dictation grammar.
-
Phonetic transcription
Three
types of Automatic Speech Recognizers are supported:
1. Recognizers
constrained by Context-Free Grammars (CFG)
2. Recognizers
constrained by Statistical Language Models (SLM)
3. Unconstrained
dictation recognizers
Methods
shall be provided to perform the following activities:
.
Notification
messages shall be generated for the following events:
The following properties
shall be supported:
(VIO-A1): Initialize Recognizer - a method shall be
provided to prepare the recognizer for operation at program startup.
(VIO-A2): Uninitialize Recognizer - a method shall be
provided to return the reocognizer to return the recognizer to its
preinitialized state.
(VIO-A3): Load Grammar - a method shall be provided to
load a grammar identified by <name> from location <URI> into the
recognizer's data space.
(VIO-A4): Unload Grammar - a method shall be provided
to unload a grammar identified by <name> from the recognizer's data
space.
(VIO-A5): Enable Grammar - a method shall be provided
to enable a grammar identified by <name> to be active during the next
subsequent Start Listen.
(VIO-A6): Disable Grammar - a method shall be provided
to disable a grammar identified by <name> from being active during the
next subsequent Start Listen.
(VIO-A7): Start Listen - a method shall be provided to
initiate the recognition process. Any acoustic information recieved after
start of speech is detected and before end of speech or babble timeout is
detected, is analyzed in the context of all currently loaded and enabled
grammars.
(VIO-A8): Stop Listen - a method shall be provided to
prematurely terminate recognizer listen activity.
(VIO-A9): Start of Seech - a message shall be
generated to signal detection of the leading edge of speech.
(VIO-A10): End of Seech - a message shall be generated
to signal detection of silence after Start of Speech equal in duration to the
value of the end-of-speech-duration property
(VIO-A11): Recognition Result Ready - a message shall be
generated to signal completion of analysis by the recognizer of acoustic information
in the context of currently-enabled grammars, signalling that the result of the
analysis is available for processing by the entity envoking the recognizer.
(VIO-A12): Silence Detected - a message shall be
generated to signal expiration of time equal to the value of the
start-of-speech property without the detection of Start of Speech.
(VIO-A13): Babble Timeout - a message shall be
generated to signal the expiration of the value of the babble-duration property
(VIO-A14): Start-of-Speech-Duration Property - a property shall be
supported which records the time in milliseconds from 'Start of Speech' to
generation of the 'Silence Detected' message (e.g., the detection of acoustic
energy detectable as speech must occur within this interval).
(VIO-A15): End-of-Speech-Duration Property - a property shall be
supported which records the duration in milliseconds for a silence
interval to occur (after start of speech) for the recognizer to generate an
'end of speech' message.
(VIO-A16): Babble-Duration Property - a property shall be supported
which records the duration in milliseconds of continuous acoustic activity on
expiration of which the 'babble timeout' message will be generated.
(VIO-T1): Request Prompt
<Note: I'm not sure what Dave Raggett means by
request prompt>
(VIO-T2): Stop Talking
(VIO-T3): Insert in Prompt Queue
(VIO-T4): Play Prompt Queue
(VIO-T5): Flush Prompt Queue
The purpose of Speaker Verification
is to pass back a decision whether or not a speaker is who (s)he purports to
be.
The purpose of Speaker
Identification is to pass back identity information with a confidence value to
parent context.
(VIO-S1): etc
(VIO-R1): Start recording
(VIO-R2): Stop recording
(VIO-R3): Play recording
(VIO-R4): Pause recording
(VIO-R5): Rewind recording
(VIO-R6): Fast forward
recording
Task -
Object
Voice Interface Object -
Input An interactive speech or multimodal implementation will use multiple input modes such as audio, speech, handwriting, and keyboarding, and other input modes. The various modes of input will be described in Section 3.
Output An interactive speech or multimodal implementation will use one or more modes of output, such as speech, text, graphics, audio files, and animation. The various modes of output will be described in Section 4.
Interaction manager The interaction manager controls the sequence of exchange for information between the human user and the application functions. It may support a variety of interaction styles including:
· System-directed dialog The system prompts the user by asking a question, and the user responds by answering the question.
· User-directed dialog The user directs the computer to perform an action, and the computer responds by presenting the results of the action to the user.
· Mixed initiative dialog This is a mixture of system-directed and user-directed dialogs in which the human user and the system take turns "driving" the dialog.
This section is informative.
This
document was jointly prepared by the members of the W3C Voice Browser Working
Group and the Multimodal Interaction Working Group.