Incorporating multimodal inputs in speech recognition from Andrew Ardill on 2013-01-16 (public-speech-api@w3.org from January 2013)

From: Andrew Ardill <andrew.ardill@gmail.com>
Date: Thu, 17 Jan 2013 10:59:18 +1100
To: public-speech-api@w3.org
Message-ID: <CAH5451kGX5yO2sCrC9YVC4BwJGk-m2oeMwNTcNtJKYv++LAdYg@mail.gmail.com>

Hi all,

I am unable to determine definitively if it is possible to incorporate
multimodal inputs during the speech recognition pipeline.

I've read through the Web Speech API Specification document at [0] and it
seems as though one is able to see interim results, but that there is no
capacity to affect the internal model that produces that interim result.

As outlined in the paper *Beyond Attention: The Role of Deictic Gesture in
Intention Recognition in Multimodal Conversational Interfaces* by Shaolin
Qu and Joyce Y. Chai from MSU [1], it is possible to use gesture
information "in two different processing stages: speech
recognition stage and language understanding stage" and that this "improves
intention recognition".

I don't believe that the Web Speech API should necessarily incorporate
gesture recognition, however at the moment it seems impossible to do this
at all in an application.

It would be great if, when a list of candidate results is generated (from
which the interim and final results must be selected), an event is
triggered that allowed the application to adjust this
list programmatically. The main concern would be just how the internal
model is exposed, and what capacity the application would have to adjust
that model. The linked paper outlines a couple of points where they
incorporate gesture information, so that might be a good place to start.

Apart from gesture information, it would be possible to use other
information to assist speech recognition. For example, the current state of
the application could be used to restrict the expected intent of the user.
This might be possible, and more appropriate, to achieve by adjusting the
available grammars; I don't have enough experience to say for sure.

If there is an appropriate way to incorporate multimodal (or any other
external) information in the speech recognition pipeline I would love to
hear about it. If an interface could be provided for programmatic
'tweaking' of the internal recognition model, that would be great too.

Thanks for your work so far, the Web Speech API is great and I look forward
to seeing it used in the wild.

Regards,

Andrew Ardill

[0] https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html
[1] http://www.cse.msu.edu/~jchai/Papers/IUI08-Intent.pdf

Received on Thursday, 17 January 2013 03:24:25 UTC