Tropo ASR/TTS Functionality Summary

Editor: Daniel C. Burnett, Voxeo
Authors: Many people at Voxeo
Creation Date: 3 March 2011

Table of Contents

1. Introduction

This document contains a summarized version of the ASR and TTS capabilities available in Tropo by Voxeo. It is presented not as a complete proposal but rather in order to provide additional viewpoints for the W3C HTML Speech Incubator Group discussion and development of an API. For more information on the Tropo API, see http://www.tropo.com/docs/scripting/.

2. Methods

2.1 say() method


Says something to the user. Unlike ask, this function has no ability to wait for a response from the user.

On voice / phone sessions:

Say can play text strings using Text To Speech (TTS) or play URLs as audio files.

On SMS / IM sessions:

Say will send a text string to the user via instant message or SMS.

More information on say can be found in the following sections: Playing Audio, and Manipulating say with SSML.



2.1.1 Usage

say( text: String, {
    allowSignals: String or Array,
    onSignal: Function,
    voice: String } )


2.1.2 Parameters

Parameter Data Type Default Required/
Optional
Description
textString(undefined)Required In the case of a voice session, this can either be the text to be rendered by the Text to Speech Engine (may also be SSML), or a URL to an audio file to be played. In the case of a text messaging session, this will be the text to be sent to the user.



2.1.3 Map parameters

Parameter Data Type Default Required/
Optional
Description
allowSignalsString or Array(any signal)Optional This parameter allows you to assign a signal to this function. Events with a matching signal name will "interrupt" the function (i.e., stop it from running). If it already ran and completed, your interrupt request will be ignored. If the function has not run yet, the interrupt will be queued until it does run.

By default, allowSignals will accept any signal as valid; if you define allowSignals as "", it defines the function as "uninterruptible". You can also use an array - the function will stop if it receives an interrupt signal matching any of the names in the array.
onSignalFunction(none)Optional This specifies a callback function to run if the function is interrupted by a signal.
voiceStringwilmaOptional Specifies the voice to be used when speaking text back to a user. Examples are:

  • maria - Castilian Spanish - Female
  • jorge - Castilian Spanish - Male
  • jeanne - French - Female
  • jean - French - Male
  • wilma - US English - Female
  • fred - US English - Male
  • mary - British English - Female
  • george - British English - Male



2.1.4 Events

none


2.1.5 Return values

none


2.1.6 Code samples

2.2 ask() method


Ask is essentially a say that requires input; it requests information from the caller and waits for a response.

For voice calls:

Audibly prompts the user for input. This can be in the form of synthesized speech or an audio file. Responses can be collected by speech or touch-tone keypad (DTMF).

For text SMS/IM

Prompts the user via text and the user can respond back by replying to the IM or SMS message.

Check out the Asking a Question section for more information.


2.2.1 Usage

ask( text: String, {
    allowSignals: String or Array,
    attempts: Integer,
    bargein: Boolean,
    choices: String,
    minConfidence: Float,
    mode: String,
    onBadChoice: Function,
    onChoice: Function,
    onError: Function,
    onEvent: Function,
    onHangup: Function,
    onSignal: Function,
    onTimeout: Function,
    recognizer: String,
    terminator: String,
    timeout: Float,
    voice: String } )


2.2.2 Parameters

Parameter Data Type Default Required/
Optional
Description
textString(undefined)Optional In the case of a voice session, this can either be the text to be rendered by the Text to Speech Engine (may also be SSML), or a URL to an audio file to be played. In the case of a text messaging session, this will be the text to be sent to the user.



2.2.3 Map parameters

Parameter Data Type Default Required/
Optional
Description
allowSignalsString or Array(any signal)Optional This parameter allows you to assign a signal to this function. Events with a matching signal name will "interrupt" the function (i.e., stop it from running). If it already ran and completed, your interrupt request will be ignored. If the function has not run yet, the interrupt will be queued until it does run.

By default, allowSignals will accept any signal as valid; if you define allowSignals as "", it defines the function as "uninterruptible". You can also use an array - the function will stop if it receives an interrupt signal matching any of the names in the array.
attemptsInteger1Optional This defines the total amount of times the user will hear the prompt before the ask ends in either a nomatch or noinput.
bargeinBooleantrueOptional The bargein attribute specifies whether or not the caller will be able to interrupt the TTS/audio output with a touch tone phone keypress or voice utterance. A value of 'true' indicates that the user is allowed to interrupt, while a value of 'false' forces the caller to listen to the entire prompt before being allowed to give input to the application. If using Python, make sure to use True and False instead of true and false.
choicesStringnoneOptional The choices field defines a simple grammar that will be active for the prompting of the user for input. For more information, review the Asking a Question and Working with Simple Grammar sections.
minConfidenceFloat0.3Optional This is the minimum amount of confidence that the "recognizer" must have before matching a response to a choice. As an example, if your grammar defines the choices as red, blue and green, and someone says "rud", a particular confidence will be set identifying how likely "rud" was meant to be "red". This is expressed in a Float as a rate between 0 and 1.
modeStringanyOptional The type of caller input allowed for voice calls. This can be 'dtmf' (touch-tone input), 'speech' or 'any'.
onBadChoiceFunction(undefined)Optional This registers an event handler that fires when the number of attempts have been exhausted without a valid response from the user.
onChoiceFunction(undefined)Optional This registers an event handler that fires when a valid response is provided by a user.
onErrorFunction(undefined)Optional This registers an event handler that fires when a system error (a non-user error) occurs during input. See onBadChoice and onTimeout for information on how to handle user errors.
onEventFunction(undefined)Optional This registers an event handler that fires as a catch all for all events.
onHangupFunction(undefined)Optional This registers an event handler that fires when the user disconnects or hangs up.
onSignalFunction(none)Optional This specifies a callback function to run if the function is interrupted by a signal.
onTimeoutFunction(undefined)Optional This event fires when the user doesn't respond to the prompt within a specified period of time.
recognizerStringen-usOptional The language to listen for; example options are:

  • 'en-gb' for British English
  • 'en-us' for US English
  • 'es-es' for Castilian Spanish
  • 'fr-fr' for French
terminatorString(none)Optional This is the touch-tone key (also known as "DTMF digit") that indicates the end of input. A common use of the terminator is the # key, eg: "Please enter your five digit zip code, then press the pound key."
timeoutFloat30.0Optional The amount of time Tropo will wait--in seconds and after sending or playing the prompt--for the user to begin a response.
voiceStringwilmaOptional Specifies the voice to be used when speaking text back to a user. Example voices are:

  • maria - Castilian Spanish - Female
  • jorge - Castilian Spanish - Male
  • jeanne - French - Female
  • jean - French - Male
  • wilma - US English - Female
  • fred - US English - Male
  • mary - British English - Female
  • george - British English - Male



2.2.4 Events

choice   error   event   hangup   timeout


2.2.5 Return values

event


2.2.6 Code samples

3 Additional Info

3.1 Playing audio files

Playing audio is just as easy as using Text To Speech (TTS) - just provide the say with a link to an accessible audio file and Tropo will play it back:

"Accessible audio file" means any web-accessible file: a file hosted on your server or from a hosting service.

You can also play multiple audio files in the same say (or ask):

You can also mix and match audio with text-to-speech:

3.2 Audio file formats

The supported sound formats (and their proper file extensions) are as follows:

You can use other formats like MP3, but they will be automatically downsampled and converted due to limitations in telephony standards, so it's always best to have your files in 8bit, 8Khz u-law format from the start.

During playback, audio is streamed directly from the source - the file isn't downloaded first and then played.

3.3 Manipulating Say with SSML

There are many cases when you need or just want to control the pitch, volume and intonation of your prompts and responses. To make this easy, Tropo natively supports a standard called the Synthesized Speech Markup Language (SSML).

SSML is an international standard from the W3C for controlling the pace, tone, pitch and all around sound of computer-generated voices. Here's a command that says something and then repeats it at a slower speed:

The previous examples made use of the rate property of the SSML prosody element to control the playback speed. Other attributes of the prosody element are pitch, contour and volume.

3.3.1 say-as

In addition to controlling pitch, volume and intonation, there are also times when you need to control how the Text to Speech engine interprets text, especially numbers. The SSML say-as element allows you to define whether the text should be interpreted as currency, digits, number, date, time and phone. While most of the options are self-explanatory, it may help to note that digits will interpret the text as individual numbers instead of one complete number ('1234' will be interpreted as 'one, two, three, four') while number will interpret the text as a complete value ('1234' will sound like 'one thousand two hundred thirty four'). Here are code examples displaying the use of say-as:

3.4 Events and signals

Tropo provides a REST API to allow for events, called signals, to be sent to functions. The first subsection explains how this works, and the following subsections give examples of allowing a single signal, multiple signals, and unnamed signals, concluding with subsections on how to specify a callback that will be executed when a signal is received and how event queuing works.

3.4.1 Signal requests

Signals are generated by making an HTTP/HTTPS GET or POST request.

A GET request is of the following form:

https://api.tropo.com/1.0/sessions/<session-id>/signals?action=signal&&value=<myname>

where <session-id> is the 16-byte GUID session ID that Tropo gives you in currentCall.sessionId and <myname> is the signal name you want sent to that session.

A POST request is of the following form:

https://api.tropo.com/1.0/sessions/<session-id>/signals

where <session-id> is the 16-byte GUID session ID that Tropo gives you in currentCall.sessionId and the header and body are one of the following forms:

where myname should be replaced with the signal name you want to send to the session.

For both POST and GET, the response body is as follows:

3.4.2 Interrupting Your Code -- One Signal

Say you want to play some hold music, then interrupt it later. In order to interrupt, you would give the say that's playing the hold music a signal using the allowSignals parameter. You can then make a web service call using that name and Tropo will stop running that function. If the function has already run and completed, your interrupt request will be ignored. If it has not run yet, it will be queued until the function runs.

This example uses "exit" for allowSignals:

3.4.3 Interrupting Your Code -- Multiple Signals

You can also use an array of signals - the function will stop if it receives an interrupt signal matching any of the names in the array.

3.4.3 Unnamed Signals

If you don't provide a function with a signal (or a list of signals), it will be interrupted by any signal sent to the API; the default value of allowSignals is essentially a wildcard. This allows you to send a signal to interrupt a Tropo function without telling Tropo ahead of time that you intend to interrupt the function. The say in this app can be interrupted by any signal:

However, if you specifically define allowSignals using "", this will be interpreted as "never interrupt" instead:

3.4.4 Signal-based callbacks

You can include an onSignal parameter that specifies a callback function to run if the function is interrupted. If included, this will run, the method will end and your script will continue. If it's not present, the method simply ends and returns control back to your script. Here's an example:

3.4.5 Event queuing

The event queue is a first-in-first-out queue. When an interruptible Tropo method runs, it starts processing the queue, discarding events that don't match until it reaches one that does. It then stops, leaving the rest of the items on the queue. Events that arrive during the execution of a Tropo method are processed in the same way. This means if you have a number of interruptible events in an application, you should take care to send interrupts in the order that they appear in your application.

Consider the following application. This application uses the "conference" method defined in Tropo but not included in this document.

You believe the hold music to be already over, so you send only an "endconf" event. But the hold music is still playing, so the say function will receive the "endconf" event, see it doesn't match, and discard it. The conference will never be interrupted.

To be safe, you can send both the "exithold" event and the "endconf" event, in order. If the hold music is already over, the conference will reject and discard the "exithold" event and move onto the next event, "endconf". If it isn't over, the hold music will be interrupted, followed immediately by the interruption of the conference.

3.5 Asking a Question

This section explains and provides examples of how to use ask.

A typical ask has three steps:

  1. Provide Tropo the question you want the user to answer.
  2. Provide Tropo with a list of possible choices.
  3. Take action based on what choice the user selects.

3.5.1 Basic Example

Here's a basic example that asks the user their favorite color, repeats it back to him/her, and records the result in the log. Best part? It'll work on any channel - phone, text or IM:

3.5.2 Asking for Digits

Tropo supports a number of simple ways to specify typical choices. For example, if you want to collect a single digit input from a user, you could do this:

Or, more simply, you can replace the numbered list with the [DIGITS] grammar:

Or if you want the user to enter their four digit pin code, just replace the ask prompt with

"Please enter your four digit pin"

and change the [DIGITS] grammar to:

[4 DIGITS]

The simple syntax Tropo uses to allow you to specify possible user inputs is called, simply put, "simple grammar". Simple grammar doesn't necessarily mean basic, however, as it can be fairly complex and comprehensive.

3.5.3 Advanced use of ask

A slightly more advanced version of the Tropo ask method has five steps:

  1. Provide Tropo the question you want the user to answer.
  2. Provide Tropo with a list of possible choices.
  3. Optional: Change the default settings Tropo uses for your "ask"
  4. Optional: Handle cases where the user does nothing or makes a bad choice.
  5. Take action based on what choice the user selects.

Tropo has several optional parameters you can set that control the behavior of an ask. For example, let's say you want to repeat your question when the user doesn't respond to it before the default 30 second timeout occurs. You can set the optional attempts parameter to ask the question up to three times before giving up:

3.5.4 Changing ask Timeouts

Tropo will by default wait up to thirty seconds for the user to respond. What if you want to wait 10 seconds between attempts instead of thirty? Just use the timeout parameter:

With the above, Tropo will still ask up to three times, but only wait 10 seconds between attempts.

3.5.5 Error handling for ask

What if the user never responds to your ask, responds with something other than the possible choices you've specified or responds with something Tropo doesn't understand? You'd probably like to tell the user they made a bad choice, and possibly provide more information so they can respond properly. To do this, just use the optional onBadChoice and onChoice event handlers:

3.5.6 Simple Grammars

Grammar is just a fancy word for telling Tropo what to expect from the user; Simple Grammar is the term we use for the built-in default way of expressing input requirements in Tropo.

In the Asking for Digits section, we introduced the [DIGITS] grammar and used it to tell Tropo to expect 1 digit. You can also express a range of digits by using [4-5 DIGITS] instead (the number values are defined by you; it could be [10-20 DIGITS], [1-2 DIGITS], and so on). If your caller enters 4 digits instead of 5, Tropo will wait for a period of time (defined by the timeout value, which automatically defaults to 30.0 seconds, but can be set longer or shorter) before considering the input complete. If you want to allow your callers to press a key to tell Tropo they're done, just add the terminator parameter to your ask statement and set it to the key they should press.

Because there are two primary ways of interacting with users over the phone (keypad and voice) and by default Tropo will listen for input in both modes, we'll want to specify "keypad" input for our pin number request. This behavior can be controlled using the mode parameter. Possible values are "keypad", "speech" or "any":

So far we've covered how to collect numeric data, but Tropo is capable of so much more. You can use Tropo's simple grammar notation to recognize words and even entire phrases.

The following script works via the phone, text or IM. It's a company directory that starts off by welcoming the user, then asks the user who they're trying to reach; this can be the name of a person or department. Depending on the type selected (person or department), one of two web services is called to play back the contact information.

There's a lot going on in this short example so let's break it down.

First, we tell Tropo to speak the introductory prompt which asks the user who they'd like to reach. We pass in two parameters to the ask method: choices and onChoice. The choices parameter instructs Tropo to listen for a set of words; in this case, department and people names. The key thing here is the notation used to define the grammar:

department(support, engineering, sales), person(jose, jason, adam)

The words outside the parentheses are called concepts. Concepts provide a context when handling the user's response. The company directory grammar defines two concepts: department and person. When the caller says one of the items inside the parentheses (such as support, jose, etc.), an event is triggered that gives us access to both the spoken word and the concept to which it belongs.

3.5.7 Advanced Grammars

While Tropo's Simple Grammar is pretty awesome, it's not really suited for extremely large data sets.

For example, let's say you were writing a travel app and wanted to allow your users to speak their destination city or airport. There are hundreds, if not thousands of airports, and many more ways to actually say them. They could say "JFK", "John F. Kennedy", "NYC", "New York International", etc. These types of complex grammars are best suited for the Speech Recognition Grammar Specification (SRGS). The SRGS is a W3C standard way of controlling speech recognition engines. SRGS can take a variety of forms, with the most popular being Grammar XML (or GRXML for short).

To use GRXML from your Tropo Scripting application, simply provide the URL to an external file.

For more information on SRGS grammars, see http://www.w3.org/TR/speech-grammar/

3.6 Events

3.6.1 Choice event

This event is returned whenever a valid response is provided by a user, such as returning "john" when the choices provided are "john, jane".

3.6.1.1 Usage

choice(, {anonymous function: STRING})

3.6.1.2 Attributes

None

3.6.1.3 Code Samples

3.6.2 Error event

This event is returned whenever an unexpected, significant system error occurs, such as the ASR engine failing. This should be a very rare event; it's unlikely to be encountered with any regularity, if at all.

3.6.2.1 Usage

error(, {anonymous function: STRING})

3.6.2.2 Attributes

None

3.6.2.3 Code Samples

3.6.3 Hangup event

This event is returned when the user disconnects or "hangs up" the call.

3.6.3.1 Usage

hangup(, {anonymous function: STRING})

3.6.3.2 Attributes

None

3.6.3.3 Code Samples

3.6.4 Timeout event

This event is returned when the user did not respond within a specified period of time, defined by the 'timeout' parameter.

3.6.4.1 Usage

timeout(, {anonymous function: STRING})

3.6.4.2 Attributes

None

3.6.4.3 Code Samples

3.6.5 Event return value

Represents the result of a system operation.

3.6.5.1 Usage

event( attempt: String, {
    choice: String,
    name: String,
    recordURI: String,
    value: String } )

3.6.5.2 Parameters

Parameter Data Type Default Required/
Optional
Description
attemptString(none)Optional This allows you to set behavior for an individual attempt (the number of possible attempts is defined by the attempts method). In the following JavaScript example, attempts is defined as 3, so it will repeat the prompt three times. For event.attempt 1 and event.attempt 2, different behavior as been defined (two different says); on the third attempt, if the caller still does not return valid input, the call will just disconnect.

ask("What's your favorite color?", {
    attempts:3,
    choices:"red, blue, green",
    onBadChoice:function(event) {
        switch(event.attempt) {
        case 1:
            say("We don't support that color. You can say red, blue or green.")
        case 2: 
            say("It's really simple, man. Just say red, blue or green."
        }
    }

choiceStringnoneOptional Please note that the event structure has additional information available in a "choice" object. Specifically, there is an "event.choice" object that itself has the following fields:
  • event.choice.concept - Only tags or concepts returned from recognition.
  • event.choice.interpretation - Full semantic interpretation of the results.
  • event.choice.utterance - What the caller actually input before interpretation.
  • event.choice.confidence - The ASR engine's confidence in the result
  • event.choice.xml - The raw NLSML result returned from the underlying MRCP engine.
nameStringnoneOptional Depending on the event that ends the method, the event.name attribute will be set to:

choice, record, timeout, badChoice, hangup, silenceTimeout, or error.
recordURIString(none)Optional This returns either the location of a recording when working with text (e.g. when used with log) or audibly returns what was actually recorded (e.g. when used with a say on a voice call), such as in this JavaScript example:

  answer()
record("Please leave your message at the beep.", {
    beep:true, timeout:10, silenceTimeout:7, maxTime:60,
        onRecord: function(event) {
            log("Recording result = " + event.recordURI)
            say("You said " + event.recordURI)
        }
})
valueStringnoneOptional The event.value attribute will be set for choice and record events, as will event.recordURI as appropriate. The rules are:

  • If choice and record are on, return event.name = choice, event.value = grammar result, and event.recordURI = recordURI
  • If choice is on but not record, return event.name = choice, event.value = grammar result, and event.recordURI = null
  • If record is on but not choice, return event.name = record, event.value = recordURI, and event.recordURI = recordURI

3.6.5.3 Code Samples