Voice Interface Objects Requirements

 

This version:

http://www.w3c.org/

Latest version:

http://www.w3c.org/

Previous version:

this is the first publication

Editors:

Bill Scholz, Unisys Corporation. mailto: Bill.Scholz@Unisys.com

Brad Porter, TellMe Corporation. mailto: Brad@Tellme.com

Contributors:

See Acknowledgements

Abstract

This document describes fundamental requirements for the specification under development in the W3C Voice Interface Objects Task Force, an activity shared jointly by the Voice Browser Working Group and the Multimodal Interaction Working Group.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.

Table of contents

1.   General Requirements

2.      Automatic Speech Recognition (ASR) Requirements

2.1  ASR Overview

2.2  Context-free Grammar (CFG) Recognizers

2.3  Statistical Language Model (SLM) Recognizers

2.4  Dictation Recognizers

2.5  Language Identification

3.   Text-To-Speech (TTS) Requiements

4.   Speaker Verification Requirements

5.      Audio Recording and Playback Requirements

6.      Audio Mixer Requirements

7.      Glossary

8.      Acknowledgements

 

Introduction

Voice Interface Objects include the components of a runtime deployment architecture created to support standalone speech applications or the speech component of multimodal interaction applications, which perform the following activities:

 

 

1. General Requirements

1.1 Introduction


‘Voice Interface Objects’ are a collection of software middleware components introduced between the Interaction Manager and mode-specific speech and audio interaction controls as illustrated in Figure 1.

Voice Interface Objects thus constitute a container class  which encapsulate all low level audio interaction controls (both Input and Output).

1.2 General Interface Requirements

Any Voice Interface Object must support the following:

1.      Object instantiation

2.      Object activation

3.      Object deactivation

4.      Ability to bind data to the invoking object’s data structure.

Voice Interface Objects should support the delegation of simple interaction idioms.  For example, the Interaction Manager should be able to delegate barge-in support to the voice interface objects; this implies that a speech recognition engine must be able to route Voice Activity Detection events to the speech synthesis engine to stop the prompt. 

 

 

1.3 Requirements to Operate in an XML Document Framework

- asynchronously process events

- DOM bindings

- does not block operation of other components defined in the page

- able to generate events to the parent context

- declarative object representation in XML

 

1.4 Interaction Task Requirements

 

Voice Interaction Objects are a class of components which collect input from a user via voice input and audio output.  They may range from a simple recognition collector, to a complex interaction dialog.  That voice input can then be coordinated through an interaction manager with the input from other modalities.

 

1.5 General Voice Input Collection Requirements

- able to support multiple languages

- associate a confidence score with data returned to parent context (as voice is not a deterministic input)

- able to return partial results to parent context

- able to return a set of potential results to the parent context

 

1.6 Dialog Input Collection Requirements

- adhere to general voice input collection requirements

- able to perform basic tapered prompt dialogs

- able to conditionally traverse from one dialog context to another

- able to sequence multiple interactions in order to generate a single

result

 

1.7 Voice Interface Objects must be message driven. 

Messages sent to an object to are used to make requests. Messages are sent by objects to indicate events.

2. Automatic Speech Recognition (ASR) Requirements

2.1 Overview:

(VIO-A1): The VIO framework must support the ability to enable multiple speech recognition tasks (eg., functionality)  simultaneously off the same audio source.

 

Motivation:

 

This allows speech recognition objects to handle specific tasks, such as one recognizer for navigation commands and another for the current dialog task.

 

(VIO-A2): ASR Objects must have the ability to support multiple simultaneously active grammars.

 

Motivation:

 

At one level this offers syntactic sugar and saves need for constructing grammar rules to combine several sets of grammars

 

At another level, this provides an alternative to (A1) so the same speech object can handle a wide variety of kinds of utterances that would otherwise be handled by multiple speech objects each dedicated to a given kind of utterance.

 

(VIO-A3):  ASR Objects must support the ability to associate  processing directives with  active grammars (eg., with objects)

 

Motivation:

 

In this model, when you enable the grammar, you specify what object/call-back to send the result to. This allows authors to use separate dialog handlers for different purposes, e.g. navigation dialogs versus date pickers.

 

(VIO-A4): ASR Objects must support the ability to take uncertainty into account when multiple grammars are simultaneously active and associated with their own handlers  eg, deal with uncertainty. 

<note: Scott: let the interaction manager handle it?> 

<note: there is some concern over this being a strict requirement.> 

 

Motivation:

 

The simplicity of associating different handlers with different grammars is appealing. The uncertainties inherence in speech recognition presents a complication. A given utterance might match two grammars each of which have a different handler.

 

A simple approach is to pick the hypothesis with the highest confidence score, possibly weighted by relative priorities of the respective handlers. Another approach is to involve an adjudicator function provided by the application developer s part of the interaction manager.

 

(VIO-A5):  ASR Objects must support the ability to push a grammar to a recognizer as a stream.

 

Motivation:

 

This might be useful for dynamically constructed grammars and contrasted with the pull model offered by passing a URI to the recognizer for retrieving the grammar from a server.

 

<note: Stephen: is this the same as remote vs inline grammar?  (SpeechSC permits a push of a grammar)>

 

(VIO-A6):  ASR Objects must support the ability to control which events deactivate a grammar.

 

Motivation:

 

In a system directed dialog, the grammar is usually  deactivated when the user stops talking or a timeout occurs. It may also be deactivated when an error occurs, e.g. out of vocabulary or malformed speech is detected.

 

In an open dialog, the user may be able to say something at any time, so the grammar remains active indefinitely until it is explicitly deactivated.

 

In a dictation system, the same thing occurs.

 

(VIO-A7):  Voice Interface Objects must be able to do perform more than one type of processing with the same audio stream at one time.

 

Motivation:

 

It should be possible to perform simultaneous recording and recognition or simultaneous recognition and verification.  This capability would be very useful for system tuning.

 

(VIO-A8):  ASR objects should provide the following categories of recognition events:

 

-          No audio source, e.g. user forgot to plug in the microphone or that the telephone connection has been broken.

 

-          Too noisy, i.e. don't expect speech to work

 

-          Time out, i.e. user didn't respond in a timely fashion

 

-          Started talking, an event signaling the time that the user started to talk

 

-          Stopped talking, an event signaling the time that the user stopped talking (or paused for too long)

 

-          Bad recognition, the recognizer failed to match the utterance to the grammar

 

-          Partial recognition, i.e. an incomplete match

 

-          N-best recognition results

 

-          Dictation chunk, such as a sentence or other convenient chunk in a stream of such chunks when using a dictation grammar.

 

-          Phonetic transcription

 

Three types of Automatic Speech Recognizers are supported:

1.      Recognizers constrained by Context-Free Grammars (CFG)

2.      Recognizers constrained by Statistical Language Models (SLM)

3.      Unconstrained dictation recognizers

 

Methods shall be provided to perform the following activities:

.

Notification messages shall be generated for the following events:

 

The following properties shall be supported:

 

2.2 CFG Recognizers

2.2.1 Methods

(VIO-A1): Initialize Recognizer - a method shall be provided to prepare the recognizer for operation at program startup.

 

(VIO-A2): Uninitialize Recognizer - a method shall be provided to return the reocognizer to return the recognizer to its preinitialized state.

 

(VIO-A3): Load Grammar - a method shall be provided to load a grammar identified by <name> from location <URI> into the recognizer's data space. 

 

(VIO-A4): Unload Grammar - a method shall be provided to unload a grammar identified by <name> from the recognizer's data space.

 

(VIO-A5): Enable Grammar - a method shall be provided to enable a grammar identified by <name> to be active during the next subsequent Start Listen.

 

(VIO-A6): Disable Grammar - a method shall be provided to disable a grammar identified by <name> from being active during the next subsequent Start Listen.

 

(VIO-A7): Start Listen - a method shall be provided to initiate the recognition process.  Any acoustic information recieved after start of speech is detected and before end of speech or babble timeout is detected, is analyzed in the context of all currently loaded and enabled grammars.

 

(VIO-A8): Stop Listen - a method shall be provided to prematurely terminate recognizer listen activity.

 

2.2.2 ASR Messages

(VIO-A9): Start of Seech - a message shall be generated to signal detection of the leading edge of speech.

 

(VIO-A10): End of Seech - a message shall be generated to signal detection of silence after Start of Speech equal in duration to the value of the  end-of-speech-duration property

 

(VIO-A11): Recognition Result Ready - a message shall be generated to signal completion of analysis by the recognizer of acoustic information in the context of currently-enabled grammars, signalling that the result of the analysis is available for processing by the entity envoking the recognizer.

 

(VIO-A12): Silence Detected - a message shall be generated to signal expiration of time equal to the value of the start-of-speech property without the detection of Start of Speech.

 

(VIO-A13): Babble Timeout - a message shall be generated to signal the expiration of the value of the babble-duration property

 

2.2.3 ASR Properties

(VIO-A14): Start-of-Speech-Duration Property - a property shall be supported which records the time in milliseconds from 'Start of Speech' to generation of the 'Silence Detected' message (e.g., the detection of acoustic energy detectable as speech must occur within this interval).

 

(VIO-A15): End-of-Speech-Duration Property - a property shall be supported which records the duration in  milliseconds for a silence interval to occur (after start of speech) for the recognizer to generate an 'end of speech' message.

 

(VIO-A16): Babble-Duration Property - a property shall be supported which records the duration in milliseconds of continuous acoustic activity on expiration of which the 'babble timeout' message will be generated.

2.3 SLM Recognizers

2.3.1 Methods

2.3.2 ASR Messages

2.3.3 ASR Properties

 

2.4 Dictation Recognizers

2.4.1 Methods

2.4.2 ASR Messages

2.4.3 ASR Properties

 

2.5 Language Identification

2.5.1 Methods

2.5.2 ASR Messages

2.5.3 ASR Properties

 

3. Text-To-Speech (TTS) Requirements

3.1 Overview

3.2 Methods

(VIO-T1): Request Prompt

<Note: I'm not sure what Dave Raggett means by ‘request prompt’>

 

(VIO-T2): Stop Talking

 

(VIO-T3): Insert in Prompt Queue

 

(VIO-T4): Play Prompt Queue

 

(VIO-T5): Flush Prompt Queue

3.3 Messages

3.4 Properties

 

4. Speaker Verification (SVER) Requirements

4.1 Overview

The purpose of Speaker Verification is to pass back a decision whether or not a speaker is who (s)he purports to be. 

 

The purpose of Speaker Identification is to pass back identity information with a confidence value to parent context.

4.2 Methods

(VIO-S1): etc

4.3 Messages

4.4 Properties

 

5. Audio Recording and Playback Requirements

5.1 Overview

5.2 Methods

(VIO-R1): Start recording

 

(VIO-R2):  Stop recording

 

(VIO-R3):  Play recording

 

(VIO-R5):  Rewind recording

 

(VIO-R6):  Fast forward recording

5.3            Messages

5.4            Properties

 

6. Audio Mixer

6.1 Overview

6.2 Methods

(VIO-M1):

6.3 Messages

6.4 Properties

 

7. Glossary

Task -

 

Object –

 

Voice Interface Object -

 

Input — An interactive speech or multimodal implementation will use multiple input modes such as audio, speech, handwriting, and keyboarding, and other input modes. The various modes of input will be described in Section 3.

 

Output — An interactive speech or multimodal implementation will use one or more modes of output, such as speech, text, graphics, audio files, and animation. The various modes of output will be described in Section 4.

 

Interaction manager — The interaction manager controls the sequence of exchange for information between the human user and the application functions. It may support a variety of interaction styles including:

·        System-directed dialog — The system prompts the user by asking a question, and the user responds by answering the question.

·        User-directed dialog — The user directs the computer to perform an action, and the computer responds by presenting the results of the action to the user.

·        Mixed initiative dialog — This is a mixture of system-directed and user-directed dialogs in which the human user and the system take turns "driving" the dialog.

 

8. Acknowledgements

This section is informative.

This document was jointly prepared by the members of the W3C Voice Browser Working Group and the Multimodal Interaction Working Group.

Appendices

Appendix A: