- From: Brian Chirls <brian.chirls@gmail.com>
- Date: Tue, 19 Jun 2018 14:06:11 -0400
- To: public-speech-api@w3.org
- Message-ID: <CAEWr9F8EDHKYz5=Zg7n1FLo1sg+7DxNAYONuNPAi1R7QCD7=nA@mail.gmail.com>
I'd like to re-open this thread with some more information that I hope will be helpful. First, the workaround solution suggested by guest271314 is a noble effort but I can't get it to work, even in Chrome. It doesn't seem possible to record from an output device. The error I'm getting indicates that this is by design. Also, I've come upon a few more use cases that aren't so easily addressed outside the browser. These mostly involve using speech output in immersive or rich-media applications. Adding this feature would enable the following: - Running post-processing filters, including panner nodes and reverb to suggest a location in space. This is increasingly relevant as WebXR API is becoming more widely available, and it would add context to the speech audio, especially when vision is not available for whatever reason. - Audio analysis for visualizing speech output. e.g. Animating a character's face when they are speaking. Imagine a multi-user chat environment like Mozilla Hubs. This would allow someone who is (situationally or permanently) unable to speak to type and have themselves represented among other speaking participants. Or if a user has their volume off or speakers disconnected, an animation might indicate that there is speech audio to be heard. - More precise timing control. Speech synthesis is asynchronous, and there is no way to determine when the speech will start in advance, nor how long an utterance will take to finish. If speaking an utterance can return an audio buffer, applications can synchronize speech with a video or another audio track. Imagine run-time translation of a subtitle/caption track, which is then spoken and synchronized accordingly. - Recording for output. With the maturity of the web audio API and increasing availability of MediaRecorder as well as offline storage and caching, there is an opportunity to make full professional audio editing applications in the browser. There is no shortage of demos and experiments to that effect. The ability to add generated text to authored audio/video output would be valuable for audio labeling of content. Buffered speech output would also be necessary to make use of offline audio contexts for faster processing for the above applications. I'm aware that there are a number of speech cloud services, though they do not address the above applications for the following reasons: - These APIs tend to have extremely tight limitations. For example, Google's solution has a limit of 300 requests per minute across an entire API key, which runs out very quickly if you want to make more than a few requests per user or if you have more than a hundred or so users at once for an entire application. IBM and Watson's limits are similar. - They don't work without an internet connection. - Even with an internet connection, latency can get quite high on mobile networks. - Those cloud service APIs are likely to be deprecated for future versions or discontinued altogether, breaking web apps after a few years unless they are actively maintained. Client-side javascript speech libraries don't even begin to match the quality of the web speech API. And they're pretty big - meSpeak.js, for example, comes in at around 3MB of scripts and data files. Thanks
Received on Tuesday, 19 June 2018 18:24:33 UTC