RE: [HTML Speech] Let's get started!

Here are some ideas I had about use cases for consideration in the XG.

A. Use cases that motivate a requirement to allow user customization of
speech recognition (recognizer and parameters).
Use case 1: user-selected recognizer
a. User has a speech disability or is not a native speaker of the expected
language of the browser's recognizer and consequently a speaker-independent
recognizer does not work well. They have a local speaker-dependent
recognizer that they would like to use with speech-enabled web applications.

b. The browser's set of available languages does not include the user's
preferred language, or the browser's recognizer for the user's preferred
language does not work very well. The user would like to select another
recognizer (local or server-based) that they know works well.

Use case 2: user-controlled speech parameters
User has difficulty speaking quickly enough for the existing timeouts
because of a speech, reading or cognitive disability and would like to
lengthen the speech timeout. For example, I've heard anecdotally that speech
timeouts are extremely stressful for people who stutter and actually make
their stuttering worse.

B. Use case that motivates a requirement to make it easy to integrate input
from different modalities.
Use case: User is using a mobile friend-finding application and says, "is
Mary Smith anywhere around here?" To answer this question the application
should combine information from geolocation (to understand "here"), speech
recognition, and potentially even speaker verification  information to
insure that Mary Smith has actually authorized the user to know where she
is. New modalities are continually becoming available, so it would be
difficult to provide for integration on a case by case basis.  

C. Use case that motivates a requirement to allow an author to specify an
application-specific statistical language model
Use case: The user is looking at a customer service/support website and asks
"There's a red flashing light on the front of my printer and the printing is
very faint. I think the model is XY 123 or something". This kind  of
SLM-type utterance would be difficult to support with a grammar, but a
general dictation model would not be able to supply application-specific
information like "model: XY 123, quality: faint, front-panel-light: red"
which you could get from an SLM with embedded grammars. The author should be
able to specify an SLM to be used for this page. This would probably require
also allowing the author to specify a recognizer because there isn't an SLM
standard.
I realize that this is in conflict with item A above because the user's
recognizer preference may be different from the author's preference, but I
think this is worth discussing.

D. Use cases that motivate a requirement to always make the use of speech
optional.
Use case 1: the user can't speak, can't speak the language well enough to be
recognized, the speech recognizer just doesn't work well for them, or they
like typing. 
Use case 2: the user is in an environment where speech is inappropriate,
like a meeting, or they want to communicate something private, or it's just
noisy.

E. Use case that the standard should support completely hands-free
operation. 
This would mean that there should be a way to speech-enable everything that
you would do with a mouse, a touchscreen, or by typing. 
Use case: the user doesn't have the use of their hands, either temporarily
or permanently, or using their hands is difficult. For example, someone is
repairing a machine, their hands are holding tools and are dirty, but they
want to browse an HTML manual for the machine.
I realize there are a lot of difficulties in completely hands-free
operation, but I wanted to put it out for discussion. It would be good to
explore how close we can come. 

F. Use case that there should be a requirement to make the standard easy to
extend.
If recognizers support new capabilities like language detection or gender
detection, it should be easy to add the results of those new capabilities to
the speech recognition result, without requiring a new version of the
standard.
Use case: The user opens an English shopping website and says "Buenos días".
The recognizer uses language detection to determine that the person is
speaking Spanish, this information is sent back to the server, and they are
switched to a Spanish version of the website. There should be an easy way to
let the recognizer convey the fact that the user is speaking Spanish back in
its result. Actually, this use case also brings up another possible
requirement that it should be possible to listen for any of several
languages in the same input. 

Received on Thursday, 9 September 2010 20:33:26 UTC