Re: MediaStream, ArrayBuffer, Blob audio result from speak() for recording? from Brian Chirls on 2018-06-19 (public-speech-api@w3.org from June 2018)

From: Brian Chirls <brian.chirls@gmail.com>
Date: Tue, 19 Jun 2018 14:06:11 -0400
To: public-speech-api@w3.org
Message-ID: <CAEWr9F8EDHKYz5=Zg7n1FLo1sg+7DxNAYONuNPAi1R7QCD7=nA@mail.gmail.com>

I'd like to re-open this thread with some more information that I hope will
be helpful.

First, the workaround solution suggested by guest271314 is a noble effort
but I can't get it to work, even in Chrome. It doesn't seem possible to
record from an output device. The error I'm getting indicates that this is
by design.

Also, I've come upon a few more use cases that aren't so easily addressed
outside the browser. These mostly involve using speech output in immersive
or rich-media applications. Adding this feature would enable the following:
- Running post-processing filters, including panner nodes and reverb to
suggest a location in space. This is increasingly relevant as WebXR API is
becoming more widely available, and it would add context to the speech
audio, especially when vision is not available for whatever reason.
- Audio analysis for visualizing speech output. e.g. Animating a
character's face when they are speaking. Imagine a multi-user chat
environment like Mozilla Hubs. This would allow someone who is
(situationally or permanently) unable to speak to type and have themselves
represented among other speaking participants. Or if a user has their
volume off or speakers disconnected, an animation might indicate that there
is speech audio to be heard.
- More precise timing control. Speech synthesis is asynchronous, and there
is no way to determine when the speech will start in advance, nor how long
an utterance will take to finish. If speaking an utterance can return an
audio buffer, applications can synchronize speech with a video or another
audio track. Imagine run-time translation of a subtitle/caption track,
which is then spoken and synchronized accordingly.
- Recording for output. With the maturity of the web audio API and
increasing availability of MediaRecorder as well as offline storage and
caching, there is an opportunity to make full professional audio editing
applications in the browser. There is no shortage of demos and experiments
to that effect. The ability to add generated text to authored audio/video
output would be valuable for audio labeling of content. Buffered speech
output would also be necessary to make use of offline audio contexts for
faster processing for the above applications.

I'm aware that there are a number of speech cloud services, though they do
not address the above applications for the following reasons:
- These APIs tend to have extremely tight limitations. For example,
Google's solution has a limit of 300 requests per minute across an entire
API key, which runs out very quickly if you want to make more than a few
requests per user or if you have more than a hundred or so users at once
for an entire application. IBM and Watson's limits are similar.
- They don't work without an internet connection.
- Even with an internet connection, latency can get quite high on mobile
networks.
- Those cloud service APIs are likely to be deprecated for future versions
or discontinued altogether, breaking web apps after a few years unless they
are actively maintained.

Client-side javascript speech libraries don't even begin to match the
quality of the web speech API. And they're pretty big - meSpeak.js, for
example, comes in at around 3MB of scripts and data files.

Thanks

Received on Tuesday, 19 June 2018 18:24:33 UTC