- From: Peter Beverloo <beverloo@google.com>
- Date: Wed, 13 Jun 2012 15:49:56 +0100
- To: public-speech-api@w3.org
- Message-ID: <CALt3x6nPDnDni5A-btan38acX1i6T=e_y-VxDgeWh=_6MdVcOg@mail.gmail.com>
Currently, the SpeechRecognition[1] interface defines three methods to start, stop or abort speech recognition, the source of which will be an audio input device as controlled by the user agent. Similarly, the TextToSpeech (TTS) interface defines play, pause and stop, which will output the generated speech to an output device, again, as controlled by the user agent. There are various other media and interaction APIs in development right now, and I believe it would be good for the Speech API to more tightly integrate with them. In this e-mail, I'd like to focus on some additional features for integration with WebRTC and the Web Audio API. ** WebRTC <http://dev.w3.org/2011/webrtc/editor/webrtc.html> WebRTC provides the ability to interact with the user's microphone and camera through the getUserMedia() method. As such, an important use-case is (video and --) audio chatting between two or more people. Audio is available through a MediaStream object, which can be re-used to power, for example, an <audio> element, transmitted to other people through a peer-to-peer connection, but can also integrate with the Web Audio API through an Audio Context's createMediaStreamSource() method. ** Web Audio API < https://dvcs.w3.org/hg/audio/raw-file/tip/webaudio/specification.html> The Web Audio API provides the ability to process, analyze, synthesize and modify audio through JavaScript. It can get its input from media files through XMLHttpRequest, from media elements such as <audio> and <video> and from any kind of other system, which includes WebRTC, that is able to provide an audio-based MediaStream. Since speech recognition and synthesis does not have to be limited to live input from and output to the user, I'd like to present two new use-cases. 1) Transcripts for (live) communication. While the specification does not mandate a maximum duration of a speech input stream, this suggestion is most appropriate for implementations utilizing a local recognizer. Allowing MediaStreams to be used as an input for a SpeechRecognition object, for example through a new "inputStream" property as an alternative to the start, stop and abort methods, would enable authors to supply external input to be recognized. This may include, but is not limited to, prerecorded audio files and WebRTC live streams, both from local and remote parties. 2) Storing and processing text-to-speech fragments. Rather than mandating immediate output of the synthesized audio stream, it should be considered to introduce an "outputStream" property on a TextToSpeech object which provides a MediaStream object. This allows the synthesized stream to be played through the <audio> element, processed through the Web Audio API or even to be stored locally for caching, in case the user is using a device which is not always connected to the internet (and when no local recognizer is available). Furthermore, this would allow websites to store the synthesized audio to a wave file and save this on the server, allowing it to be re-used for user agents or other clients which do not provide an implementation. The Web platform gains its power by the ability to combine technologies, and I think it would be great to see the Speech API playing a role in that. Best, Peter [1] http://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#speechreco-section
Received on Wednesday, 13 June 2012 14:50:33 UTC