Re: [mediacapture-main] New audio acquisition constraints (#671)

Well, the authors of the specification explicitly stated that Media Capture and Streams have nothing to do with TTS/SST at Support capturing audio output from sound card #629 https://github.com/w3c/mediacapture-main/issues/629#issuecomment-572624317

> Jan-Ivar said:
> 
> "I don't think mediacapture-main should specify generic capture of audio output from a browser or document without the context of a compelling user problem that it best solves.
> 
> Especially not to work around another web API failing to provide adequate access¹ to audio it generates, to solve a use case that seems reasonable in that spec's domain.
> 
> Instead, I'd ask that API to solve it without workarounds. Feel free to refer to mozilla/standards-positions#170 (comment).
> 
> The audio capture support in getDisplayMedia is not a good fit, as it's A) aimed at the screen-sharing use-case B) optional, and C) solely complementary to video."
> 
> [BA] Closing this issue.

and Extending Media Capture and Streams with MediaStreamTrack kind TTS #654 https://github.com/w3c/mediacapture-main/issues/654#issuecomment-575203694

> Production of a MediaStreamTrack from TTS should be an extension spec for the Text-To-Speech API, not a feature of the MediaStreamTrack API.

Since all of a sudden Media Capture and Streams _is_ incorporating TTS/SST language   into the specification, those closed issues should be summarily re-opened to avoid any hint of, at a minimum, ambiguity and inconsistency, relevant to whom makes a feature request that gets closed and a feature request that is submitted by the same individual that closed the previous issue, when their parent corporation is the source is the feature request.

Adding a category does not do anything in particular.

The purpose of Support capturing audio output from sound card #629 is specifically to handle STT input, where the sound is captured directly from the sound card, instead of the microphone. The point of that algorithm is precisely because through experimenting with TTS/SST have found that 

1) The input to SST should be normalized. For several reasons. No two live human voice input will output the same `MediaStreamTrack` due to potential for packet loss when Opus is used. That means, in general, no two recordings will have the same file length, except for AV1 or H264 codec usage, which will have the same output for the same input. One way to normalize input is pipe voice though TTS to always input the output of TTS to SST. Then, perhaps, the algorithms used to parse speech input will at least have the same input as it "learns", e.g., consider https://bugzilla.mozilla.org/show_bug.cgi?id=1604994 where the input was output from `espeak-ng`, meaning the input was always the same. Even then the output from SST was not. With a human voice, even where "echo cancellation" is used, the variance between audio input can obviously have a very wide range, whereas transforming actual human voice input, or pre-recorded input through a TTS algorithm that normalizes input from the STT, will decrease the "learning curve" of the algorithm used to parse speech input. This is, of course, if the user has any control over the STT parsing algorithm, if not, then they are subject top the whims of the potentially narrow lens and lexicon of whomever is writing the SST algorithm. For example, when last tested `webkitSpeechRecognition`, "the algorithm" censored certain words https://bugs.chromium.org/p/chromium/issues/detail?id=804812, in which case it does not matter what category or label is placed onto the `MediaStreamTrack`, as "the algorithm" (some human who intentionally wrote the code) just removes your intent from the output. Capturing directly from the sound card means zero echo cancellation need as the user controls entire the audio, whether WAV or other codec being used. If necessary the file can be converted to a `MediaStreamTrack` for transmission to a local or remote SST program or service.  "optimation" re the input audio track is meaningless when the service censors content on purpose. Normalizing human voice input to be that of TTS input can be one form of "optimization".

2) The user has zero control over TTS output at browsers when using `speechSynthesis.speak()`.  Particularly, there is no way to know exactly when the output stream of TTS ends due to the potential for `<break>` and `<break time="5000">` and `<break time="10000" strength="medium">` elements in SSML input https://github.com/guest271314/SpeechSynthesisSSMLParser/blob/master/break/parseSSMLBreakElementStrengthAndTimeAttributes.js where testing for silence using `AudioContext` can give false positive as to when output ends, `ended`  event would not be fired at an SSML `<break>` element, only at end of file. This can be greatly improved by specifying a `kind` `"TTS"` which outputs audio from local or remote TTS program or service as a `MediaStreamTrack`, which ends and actually fires `ended` event when EOF is reached. There is also no way to record TTS output by way of an API of specification. While workarounds are possible https://github.com/guest271314/SpeechSynthesisRecorder, particularly at Nightly and Firefox, which expose `Monitor of <device>` as a device there is no way to capture speech synthesis output. This can be solved by extending `MediaStreamTrack` with `kind` `"TTS"` in this specification to address that missing coverage from all specifications. 

Can you clearly define and describe what is meant by

> a stream with optimizations for speech

?

What specific steps are being proposed to achieve "optimizations for speech"?

-- 
GitHub Notification of comment by guest271314
Please view or discuss this issue at https://github.com/w3c/mediacapture-main/issues/671#issuecomment-605365112 using your GitHub account

Received on Saturday, 28 March 2020 00:17:49 UTC