- From: Andrew Ardill <andrew.ardill@gmail.com>
- Date: Thu, 17 Jan 2013 10:59:18 +1100
- To: public-speech-api@w3.org
- Message-ID: <CAH5451kGX5yO2sCrC9YVC4BwJGk-m2oeMwNTcNtJKYv++LAdYg@mail.gmail.com>
Hi all, I am unable to determine definitively if it is possible to incorporate multimodal inputs during the speech recognition pipeline. I've read through the Web Speech API Specification document at [0] and it seems as though one is able to see interim results, but that there is no capacity to affect the internal model that produces that interim result. As outlined in the paper *Beyond Attention: The Role of Deictic Gesture in Intention Recognition in Multimodal Conversational Interfaces* by Shaolin Qu and Joyce Y. Chai from MSU [1], it is possible to use gesture information "in two different processing stages: speech recognition stage and language understanding stage" and that this "improves intention recognition". I don't believe that the Web Speech API should necessarily incorporate gesture recognition, however at the moment it seems impossible to do this at all in an application. It would be great if, when a list of candidate results is generated (from which the interim and final results must be selected), an event is triggered that allowed the application to adjust this list programmatically. The main concern would be just how the internal model is exposed, and what capacity the application would have to adjust that model. The linked paper outlines a couple of points where they incorporate gesture information, so that might be a good place to start. Apart from gesture information, it would be possible to use other information to assist speech recognition. For example, the current state of the application could be used to restrict the expected intent of the user. This might be possible, and more appropriate, to achieve by adjusting the available grammars; I don't have enough experience to say for sure. If there is an appropriate way to incorporate multimodal (or any other external) information in the speech recognition pipeline I would love to hear about it. If an interface could be provided for programmatic 'tweaking' of the internal recognition model, that would be great too. Thanks for your work so far, the Web Speech API is great and I look forward to seeing it used in the wild. Regards, Andrew Ardill [0] https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html [1] http://www.cse.msu.edu/~jchai/Papers/IUI08-Intent.pdf
Received on Thursday, 17 January 2013 03:24:25 UTC