RE: [HTML Speech] Let's get started!

Here's a set of use cases, loosely grouped as "Recognition Scenarios", "TTS Scenarios", and "Service & Technology Concerns".

It's not exhaustive, and misses some of the cases posted in the last few days, but it covers a lot of what's necessary to effectively enable visual speech web apps.


*** RECOGNITION SCENARIOS  ***

+ INTERACTING WITH MULTIPLE INPUT ELEMENTS

Many web applications incorporate a collection of input fields, generally expressed as forms, with some text boxes to type into and lists to select from, with a "submit" button at the bottom.  For example, "find a flight from New York to San Francisco on Monday morning returning Friday afternoon" might fill in a web form with two input elements for origin (place & date), two for destination (place & time), one for mode of transport (flight/bus/train), and a command (find) for the "submit" button.  This application is valuable because the user just has to initiate speech recognition once to use the entire screen.  If they had to initiate speech for every input element (five times in this example) the application would be un-usable.

+ INTERACTING WITHOUT VISIBLE INPUT ELEMENTS

Some speech applications are oriented around determining the user's intent before gathering any specific input, and hence their first interaction may have no visible input fields whatsoever, or may accept speech input that is far less constrained than the fields on the screen.  For example, the user may simply be presented with the text "how may I help you?" (maybe with some speech synthesis or an earcon), and then utter their request, which the application analyzes in order to route the user to an appropriate part of the application.  This isn't simply selection from a menu, because the list of options may be huge, and the number of ways each option could be expressed by the user is also huge.  In any case, the speech UI (grammar) is very different from whatever fields may or may not be displayed on the screen.


+ RE-RECOGNITION

Some sophisticated applications will re-use the same utterance in two or more recognitions.  For example, an application may ask "how may I help you?", to which the user responds "find me a round trip from New York to San Francisco on Monday morning, returning Friday afternoon".  An initial recognition against a broad language model may be sufficient to understand that the user wants the "flight search" portion of the app.  Rather than get the user to repeat themselves, the application will just re-use the existing utterance for the recognition on the flight search recognition.


+ RECOGNITION WHILE RECORDING

Some applications both want the text representation of what the user said but also want the audio captured (I.e., multimodal answering/message system that sends a text based email or text but also records a message).  This can also be used in a customer care web application where the text recognition may attempt to solve a customer's issue (I.e., a FAQ lookup) but may submit the recorded audio for later agent processing without requiring the user to resay or retype what the problem was if the recognition was poor.


+ CONTINUOUS RECOGNITION

Some applications will provide an experience where a user can speak for an extended period, and the application will process recognition events as the user speaks.  Some examples:
1. When dictating an email, the user will continue to utter sentences until they're done composing their email.  The application will provide continuous feedback to the user by displaying words within a brief period of the user uttering them.  The application continues listening and updating the screen until the user is done.  Sophisticated applications will also listen for command words used to add formatting, perform edits, or correct errors.
2. Some form filling applications will also continue to listen as the user utters new information for the form.  For example "schedule a meeting with Michael on Friday at 8 AM... uh... 9:30 AM, for half an hour, at the donut shop... add Rick and Daphne too" could result in the form being filled as the user speaks, providing a tighter user experience.


+ VOICE ACTIVITY DETECTION

Automatic detection of speech/non-speech boundaries is needed for a number of valuable user experiences:

1. Press-once to talk.  The user manually interacts with the app to indicate that the app should start listening.  For example, they raise the device to their ear, press a button on the keypad, or touch a part of the screen.  When they're done talking, the app automatically performs the speech recognition without the user needing to touch the device again.  This is already a common type of UX for smartphone applications, and is easier to use than press-and-talk (like a walkie-talkie), or having to press a second time when done talking (which isn't as precise, and can either truncate the utterance, or include superfluous noise).

2. Hands-free dialogue, where the user can start and stop talking without any manual input to indicate when the application should be listening.  The application and/or browser needs to automatically detect when the user has started talking, so it can initiate SR.  This is particularly useful for in-car or 10-foot usage (e.g. living room).

3. Mixed initiative.  Some applications will prompt the user with clarifying questions.  Ideally the user can respond verbally, without any need for manual interaction.

For networked SR, VAD also helps prevent unnecessary transmission of silence or irrelevant background audio.


+ USER PERCEIVED LATENCY

The time between the user completing their utterance, and an application providing a response, needs to fall below an acceptable threshold to be usable.  For example, "find a flight from New York to San Francisco on Monday morning returning Friday afternoon" takes about 6 seconds to say, but the user still expects a response within a couple of seconds (generally somewhere between 500 and 3000 milliseconds, depending on the specific application and audience).   The design and tuning of an application has some influence over this.  But the underlying platform is also responsible for a large portion.  In the case of applications/browsers that invoke speech recognition over a network, the platform needs to support (i) using a codec that can be transmitted in real-time on the modest bandwidth of many cell networks and (ii) transmitting the user's utterance in real-time (e.g. in 100ms packets) rather than collect the full utterance before transmitting any of it.  For applications that use short utterances of one or two command words on fast networks, this may not be critical.  But for apps where the utterances are non-trivial and the grammars can be recognized in real-time or better, real-time streaming can all but eliminate user-perceived latency.


+ SPEECH RECOGNITION GUI

Multimodal speech recognition apps are typically accompanied by a GUI experience to (i) provide a means to invoke SR; and (i) indicate progress of recognition through various states (listening to the user speak; waiting for the recognition result; displaying errors; displaying alternates; etc)

There are probably two general cases:

1. Polished applications generally have their own GUI design for the speech experience.  This will usually include a clickable graphic to invoke speech recognition, and graphics to indicate the progress of the recognition through various states.

2. Many applications, at least in their initial development, and in some cases the finished product, will not implement their own GUI for controlling speech recognition.  These applications will rely on the browser to implement a default control to begin speech recognition, such as a GUI button on the screen or a physical button on the device, keyboard or microphone.  They will also rely on a default GUI to indicate the state of recognition (listening, waiting, error, etc).


+ EXPRESSION & ANALYSIS OF RECOGNITION RESULTS

For any given utterance, a speech recognizer will generally produce a range of different hypotheses with varying degrees of confidence and suggest the most likely to the application.  Applications try to avoid presenting junk to the user by enforcing minimum confidence thresholds, and providing any alternative hypotheses that exceed the confidence threshold, in case the best guess is wrong.  To do this, applications will need an easy way to filter by confidence threshold, and receive a list of the N best results (N will vary from one app to another).

Sophisticated applications may need access to more detailed information such as the confidence and alternatives for individual words or phrases.  This would be useful for multi-slot form-filling or dictation.

+ DOMAIN SPECIFIC GRAMMARS

A web developer may want a recognition against one, or more, author specified grammar.  An application may, for instance, have a large grammar covering the domain in question (all North American professional sports teams in MLB, NFL, NBA, MLS) but also have a user specified personal grammar covering other possible values (local high school sports teams) and may want the same input element to recognize against these multiple input grammars.


+ WEB AUTHOR ABLE TO WRITE ONCE PER APPLICATION
(not once per user agent per application, nor once per user agent per device type per application)

High quality speech applications often involve a lot of tuning of recognition parameters and grammars to work with different recognition technologies.  A web author may wish for her application to only need to tune the speech recognition with one technology stack, and not have to tune and special case different grammars and parameters for different user agents.  There exists enough browser detection in the web developer world to deal with accidental incompatibility and legacy implementations without causing speech to require it by design for quality speech recognition.


*** TTS/MEDIA Scenarios ***


+ SYNCHRONIZED SYNTHESIS

When speech synthesis is used in a multimodal application, it often needs to be synchronized with the display.  For example:
1. When reading a list of items, each item is selected on the screen as it's read.
2. When reading text, each phrase is highlighted as it's read.


+ BARGE-IN

The ability to stop output (text-to-speech or media) in response to events (user starting to speak, a recognition occurring, other events or selections or browser interactions, etc.) so that the web application user experience is acceptable and the web application doesn't appear confused or deaf to user input.


+ FALSE BARGE-IN

"Barge-in" aids the usability of an application by allowing the user provide spoken input even while the application is playing media/TTS.  However, applications that both speak (or play media) and listen at the same time can potentially interfere with their own speech recognition.  In telephony, this is less of a problem due to the design of the handset, and built-in echo-cancelling technology.  However, with broad variety of HTML-capable devices, situations that involve open-mic and open-speaker will be potentially more common.   To help developers cope with this, it may be useful to either specify a minimum barge-in capability that all browsers should meet, or make it easier for developers to discover when barge-in may be an issue.


+ PLAYBACK CONTROL

When listening to a synthesized passage, some apps will need to be able to replay the last sentence or paragraph; or skip backwards or forwards.  This isn't simply a tape-player analogy, since the user's navigational reference is by discrete chunks of spoken language, not by the abstract passage of time.


*** SERVICE & TECHNOLOGY CONCERNS ***


+ USER PRIVACY & SECURITY

The potential for any HTML page to listen to the user and extract meaning from the user's utterances raises privacy and security concerns that will need to be addressed:

1. Many users are sensitive about who or what is listening to them, and will not tolerate an application that listens to the user without the user's knowledge.  A browser needs to provide clear indication to the user whenever it is using a microphone to listen to the user.

2. Some users will want to explicitly grant permission for the browser, or an application, to listen to them.  Whether this is a setting that is global, applies to a subset of applications/domains, etc, depends somewhat on the security & privacy expectations of the browser's customers.

3. The user also needs to be able to trust and verify that their utterance is processed by the application that's on the screen (or its backend servers), or at least by a service the user trusts.

In some countries, these users' expectations will be reinforced by legal rights.


+ OPEN AVAILABILITY OF SERVICES

There is a wide range of speech technology available, and innovation continues rapidly.  No single technology vendor provides technology that suits all scenarios for all applications, and vendors will continue to invent novel and useful technology for new scenarios.  Furthermore, even vendors who have overlapping capabilities will have sufficiently different performance characteristics, such that any particular application will be tuned to perform best with a specific vendor's technology.  For an application to work the same across a range of browsers and devices, it needs to have access to the same backing speech services it was tailored for.  Thus any browser should work with any vendor's speech services, provided it meets specific open protocol requirements.


+ STANDARDIZED INPUT & OUTPUT FORMATS & CONVENTIONS

No developer likes to be locked into a particular vendor's implementation.  In some cases this will be unavoidable due to differentiation in capabilities between vendors.  But general concepts like grammars, TTS and media composition, and recognition results should use standard formats (e.g. SRGS, SSML, SMIL, EMMA).


+ SPECIFICATION OF SERVICE CAPABILITIES & HINTS

The speech web services provided by a vendor may have a wide range of capabilities.  For example:

1.  An application will use particular language models.  This could be a specific set of context-free grammars (such as SRGS documents hosted on the application's web servers) and their corresponding weights.  Or it may be a predefined model defined by the service, such as "web-search", "message-dictation", etc.
2.  Some applications may also specify other requirements, such as the language spoken by the user, or the desired acoustic model.
3. Some applications will provide contextual information that will narrow the scope of the recognition and improve the likelihood of a relevant result (such as GPS location, the names of people the user knows, which page of an app the user is on, the user's last N actions, etc) 
4. Some users will have specific profile information that improves their overall speech recognition experience, such as personalized lexicons, acoustic model adaptation, etc.

Received on Friday, 10 September 2010 17:40:50 UTC