- From: Deborah Dahl <dahl@conversational-technologies.com>
- Date: Thu, 9 Sep 2010 16:32:49 -0400
- To: <public-xg-htmlspeech@w3.org>
Here are some ideas I had about use cases for consideration in the XG. A. Use cases that motivate a requirement to allow user customization of speech recognition (recognizer and parameters). Use case 1: user-selected recognizer a. User has a speech disability or is not a native speaker of the expected language of the browser's recognizer and consequently a speaker-independent recognizer does not work well. They have a local speaker-dependent recognizer that they would like to use with speech-enabled web applications. b. The browser's set of available languages does not include the user's preferred language, or the browser's recognizer for the user's preferred language does not work very well. The user would like to select another recognizer (local or server-based) that they know works well. Use case 2: user-controlled speech parameters User has difficulty speaking quickly enough for the existing timeouts because of a speech, reading or cognitive disability and would like to lengthen the speech timeout. For example, I've heard anecdotally that speech timeouts are extremely stressful for people who stutter and actually make their stuttering worse. B. Use case that motivates a requirement to make it easy to integrate input from different modalities. Use case: User is using a mobile friend-finding application and says, "is Mary Smith anywhere around here?" To answer this question the application should combine information from geolocation (to understand "here"), speech recognition, and potentially even speaker verification information to insure that Mary Smith has actually authorized the user to know where she is. New modalities are continually becoming available, so it would be difficult to provide for integration on a case by case basis. C. Use case that motivates a requirement to allow an author to specify an application-specific statistical language model Use case: The user is looking at a customer service/support website and asks "There's a red flashing light on the front of my printer and the printing is very faint. I think the model is XY 123 or something". This kind of SLM-type utterance would be difficult to support with a grammar, but a general dictation model would not be able to supply application-specific information like "model: XY 123, quality: faint, front-panel-light: red" which you could get from an SLM with embedded grammars. The author should be able to specify an SLM to be used for this page. This would probably require also allowing the author to specify a recognizer because there isn't an SLM standard. I realize that this is in conflict with item A above because the user's recognizer preference may be different from the author's preference, but I think this is worth discussing. D. Use cases that motivate a requirement to always make the use of speech optional. Use case 1: the user can't speak, can't speak the language well enough to be recognized, the speech recognizer just doesn't work well for them, or they like typing. Use case 2: the user is in an environment where speech is inappropriate, like a meeting, or they want to communicate something private, or it's just noisy. E. Use case that the standard should support completely hands-free operation. This would mean that there should be a way to speech-enable everything that you would do with a mouse, a touchscreen, or by typing. Use case: the user doesn't have the use of their hands, either temporarily or permanently, or using their hands is difficult. For example, someone is repairing a machine, their hands are holding tools and are dirty, but they want to browse an HTML manual for the machine. I realize there are a lot of difficulties in completely hands-free operation, but I wanted to put it out for discussion. It would be good to explore how close we can come. F. Use case that there should be a requirement to make the standard easy to extend. If recognizers support new capabilities like language detection or gender detection, it should be easy to add the results of those new capabilities to the speech recognition result, without requiring a new version of the standard. Use case: The user opens an English shopping website and says "Buenos días". The recognizer uses language detection to determine that the person is speaking Spanish, this information is sent back to the server, and they are switched to a Spanish version of the website. There should be an easy way to let the recognizer convey the fact that the user is speaking Spanish back in its result. Actually, this use case also brings up another possible requirement that it should be possible to listen for any of several languages in the same input.
Received on Thursday, 9 September 2010 20:33:26 UTC