[mediacapture-main] Extending Media Capture and Streams with MediaStreamTrack kind TTS (#654) from guest271314 via GitHub on 2020-01-09 (public-webrtc@w3.org from January 2020)

From: guest271314 via GitHub <sysbot+gh@w3.org>
Date: Thu, 09 Jan 2020 18:13:56 +0000
To: public-webrtc@w3.org
Message-ID: <issues.opened-547639275-1578593633-sysbot+gh@w3.org>
guest271314 has just created a new issue for https://github.com/w3c/mediacapture-main:

== Extending Media Capture and Streams with MediaStreamTrack kind TTS ==
**Extending Media Capture and Streams with MediaStreamTrack kind TTS or: What is the canonical procedure to programmatically create a virtual media device where the source is a local file or piped output from a local application?**

The current iteration of the specification includes the language

https://w3c.github.io/mediacapture-main/#dfn-source

> _**source**_
> A source is the "thing" providing the source of a media stream track. The source is the broadcaster of the media itself. A source can be a physical webcam, microphone, local video or audio file from the user's hard drive, network resource, or static image.

and 

https://w3c.github.io/mediacapture-main/#extensibility

> Extensibility

In pertinent part

> The purpose of this section is to provide guidance to creators of such extensions.

and 

https://w3c.github.io/mediacapture-main/#defining-a-new-media-type-beyond-the-existing-audio-and-video-types

> 16.1 Defining a new media type (beyond the existing Audio and Video types)

The list items under the above section are incorporated by reference herein.

**Problem**

Web Speech API (W3C) is dead. 

The model is based on communication with `speech-dispatcher` to output audio from the sound card, where the user (consumer) has no control over the output.

This proposal is simple: 

Extend `MediaStreamTrack` to include a `kind` `TTS` where the `source` is output from a local TTS (Text To Speech; speech synthesis) engine.

The model is also simple: Assuming there is a local `.txt` or `.xml` document the input text is read by the TTS application from the local file. The outpiut is a `MediaStream` containing a single `MediaStreamTrack` of `type` and `label` `TTS`.

The `source` file is read and output as a `MediaStreamTrack` within a `MediaStream` after `getUserMedia()` prompt.

When the read of the file reaches `EOF` the `MediaStreamTrack` of `kind` `TTS` automatically stops, similar to [MediaRecorder: Implements spontaneous stopping.](https://github.com/web-platform-tests/wpt/pull/20865)

Such functionality exists for testing, in brief

- [fake_audio_output_stream.cc](https://chromium.googlesource.com/chromium/src/+/4cdbc38ac425f5f66467c1290f11aa0e7e98c6a3/media/audio/fake_audio_output_stream.cc)
- [fake_audio_manager.cc](https://chromium.googlesource.com/chromium/src/+/4cdbc38ac425f5f66467c1290f11aa0e7e98c6a3/media/audio/fake_audio_manager.cc)

For example

```
# launch
chromium-browser --allow-file-access-from-files --autoplay-policy=no-user-gesture-required --use-fake-device-for-media-stream --use-fake-ui-for-media-stream --use-file-for-fake-audio-capture=$HOME/test.wav%noloop --user-data-dir=$HOME/test 'file:///home/user/testUseFileForFakeAudioCaptureChromium.html'

// use at main thread
navigator.mediaDevices.getUserMedia({audio: true})
.then(mediaStream => {
  const ac = new AudioContext();
  const source = ac.createMediaStreamSource(mediaStream);
  source.connect(ac.destination);
});
```

One problem with using that testing code for production to meet the requirement of outputting resulting of TTS is that there is no way to determine `EOF` without getting the `duration` of the file before playback at `MediaStream`, since SSML can include `<break time="5000ms"/>` analyzing audio output stream for silence can lead to prematurely executing `MediaStreamTrack.stop()` in order to end the track. 

When called multiple times in succession, even after having statted the file twice to get the `duration`, after two to three calls no sound is output. 

MacOS also has an issue with that flag [`--use-file-for-fake-audio-capture` doesn't work on Chrome](https://github.com/cypress-io/cypress/issues/5592).

We can write the input `.txt` or `.xml` file to local filesystem using File API or Native File System, therefore input is not an issue.

**Why Media Capture and Streams and not Web Speech API?**

W3C Web Speech API is dead.

W3C Web Speech API was not initially written to provide such functionality, even though the underlying speech synthesis application installed on the local machine might have such functionality.

Even if Web Speech API does become un-dead to move to, or provide `MediaStream` and `MediaStreamTrack` as output options some form of collarboration with and reliance on this governing specification will be required, thus it is reasonable to simply begin from the Media Capture and Stream specification re "Extensibility" and work backwards, or rather, work from both ends towards the middle. Attempting to perform either modification in isolation might prove to be inadequate. 

If there is any objection as to W3C Web Speech API being dead, re the suggestion to deal with speech synthesis in Web Speech API specification, then that objection must include the reason why Web Speech API hass not implemented SSML parsing flag when the patch has been available for some time https://bugs.chromium.org/p/chromium/issues/detail?id=795371#c18, and why instead of actually using the Web Speech API, ChromiumOS authors decided to use `wasm` and `espeak-ng` to implement TTS https://chromium.googlesource.com/chromiumos/third_party/espeak-ng/+/refs/heads/chrome, essentially abandoning Web Speech API usage?

--

An alternative approach to solve the use case is for the specification to compose the formal steps necesary to create a virtual media device that `getUserMedia()` provides access to as "microphone" (becuase the device is virtual we can assign said device as a microphone which should be listed at `getUserMedia()` prompt and listed at `enumerateDevices()`.

, e.g., https://stackoverflow.com/a/40783725

```
    diff --git a/webrtc/modules/audio_device/dummy/file_audio_device.cc b/webrtc/modules/audio_device/dummy/file_audio_device.cc
index 8b3fa5e..2717cda 100644
--- a/webrtc/modules/audio_device/dummy/file_audio_device.cc
+++ b/webrtc/modules/audio_device/dummy/file_audio_device.cc
@@ -35,6 +35,7 @@ FileAudioDevice::FileAudioDevice(const int32_t id,
     _recordingBufferSizeIn10MS(0),
     _recordingFramesIn10MS(0),
     _playoutFramesIn10MS(0),
+    _initialized(false),
     _playing(false),
     _recording(false),
     _lastCallPlayoutMillis(0),
@@ -135,12 +136,13 @@ int32_t FileAudioDevice::InitPlayout() {
       // Update webrtc audio buffer with the selected parameters
       _ptrAudioBuffer->SetPlayoutSampleRate(kPlayoutFixedSampleRate);
       _ptrAudioBuffer->SetPlayoutChannels(kPlayoutNumChannels);
+      _initialized = true;
   }
   return 0;
 }

 bool FileAudioDevice::PlayoutIsInitialized() const {
-  return true;
+  return _initialized;
 }

 int32_t FileAudioDevice::RecordingIsAvailable(bool& available) {
@@ -236,7 +238,7 @@ int32_t FileAudioDevice::StopPlayout() {
 }

 bool FileAudioDevice::Playing() const {
-  return true;
+  return _playing;
 }

 int32_t FileAudioDevice::StartRecording() {
diff --git a/webrtc/modules/audio_device/dummy/file_audio_device.h b/webrtc/modules/audio_device/dummy/file_audio_device.h
index a69b47e..3f3c841 100644
--- a/webrtc/modules/audio_device/dummy/file_audio_device.h
+++ b/webrtc/modules/audio_device/dummy/file_audio_device.h
@@ -185,6 +185,7 @@ class FileAudioDevice : public AudioDeviceGeneric {
   std::unique_ptr<rtc::PlatformThread> _ptrThreadRec;
   std::unique_ptr<rtc::PlatformThread> _ptrThreadPlay;

+  bool _initialized;;
   bool _playing;
   bool _recording;
   uint64_t _lastCallPlayoutMillis;
```

in order to not have to ask this body to specify the same in the official standard, just patch in the virtual device to the existing infrastructure.

-- 

**Use cases**

For some reason, users appear to feel more comfortable using standardized API's rather than rolling their own. For those users a canonical means to patch into the existing formal API without that functionality being offcially written might provide the assurance they seem to want that the means used are appropriate and should "work". Indeed, some users appear to not be aware that currently Web Speech API itself does not provide any algorithm to synthesize text to speech, it is hard to say.

> [Support SpeechSynthesis *to* a MediaStreamTrack](https://github.com/WICG/speech-api/issues/69)
> 
> It would be very helpful to be able to get a stream of the output of SpeechSynthesis.
> 
> For an explicit use cases, I would like to:
> 
> - position speech synthesis in a virtual world in WebXR (using Web Audio's PannerNode)
> - be able to feed speech synthesis output through a WebRTC connection
> - have speech synthesis output be able to be processed through Web Audio
> (This is similar/inverse/matching/related feature to #66.)
> Though they are aware that the output is sqaurely their media - user media - that they should be able to "get".
>
> https://github.com/WICG/speech-api/issues/69#issuecomment-539123395
>  I think it would be good to have one, relatively simply API to do TTS. I additionally am suggested here in this issue that you should be able to get a Media Stream of that output (rather than have it piped to audio output).

and 

> [Use and parse SSML to change voices, pitch, rate](https://github.com/rumkin/duotone-reader/issues/3)
> 
> Hi, thanks for this. I wish it be more widely adopted solution, which everyone can run in their browser without a need to install something in their system. But I understand it's not possible in the moment, so I'll be searching a way to make communicating with WhatWG.
> 
> I'll reopen this to help others to read this issue. Please don't close it.


The latter case should be easily solved by implementing SSML parsing. However, that has not been done

- [Issue 795371: Implement SSML parsing at SpeechSynthesisUtterance](https://bugs.chromium.org/p/chromium/issues/detail?id=795371)
- [Implement SSML parsing at SpeechSynthesisUtterance](https://bugzilla.mozilla.org/show_bug.cgi?id=1425523)

even though the patch to o so exists https://bugs.chromium.org/p/chromium/issues/detail?id=795371#c18


```
---
chrome/browser/speech/tts_linux.cc     \|    1 +
third_party/speech-dispatcher/BUILD.gn \|    1 +
2 files changed, 2 insertions(+)
 
--- a/chrome/browser/speech/tts_linux.cc
+++ b/chrome/browser/speech/tts_linux.cc
@@ -137,6 +137,7 @@ void TtsPlatformImplLinux::Initialize()
libspeechd_loader_.spd_set_notification_on(conn_, SPD_CANCEL);
libspeechd_loader_.spd_set_notification_on(conn_, SPD_PAUSE);
libspeechd_loader_.spd_set_notification_on(conn_, SPD_RESUME);
+  libspeechd_loader_.spd_set_data_mode(conn_, SPD_DATA_SSML);
}
 
TtsPlatformImplLinux::~TtsPlatformImplLinux() {
--- a/third_party/speech-dispatcher/BUILD.gn
+++ b/third_party/speech-dispatcher/BUILD.gn
@@ -19,6 +19,7 @@ generate_library_loader("speech-dispatch
"spd_pause",
"spd_resume",
"spd_set_notification_on",
+    "spd_set_data_mode",
"spd_set_voice_rate",
"spd_set_voice_pitch",
"spd_list_synthesis_voices",
```



and the maintainers of `speech-dispatcher` (`speechd`) are very helpful.

Tired of waiting for Web Speech API to be un-dead, wrote an SSML parser from scratch using JavaScript

- [SpeechSynthesisSSMLParser](https://github.com/guest271314/SpeechSynthesisSSMLParser)

So, no, 

> https://github.com/w3c/mediacapture-main/issues/629#issuecomment-572624317
> Especially not to work around another web API failing to provide adequate access¹ to audio it generates, to solve a use case that seems reasonable in that spec's domain.

is not applicable anymore. Why would users have any confidence that the Web Speech API is un-dead and will eventually address the issue?

Besides, in order to get output as a `MediaStream` this specification would needd to be involved in some non-trivial way as a reference.

-- 

The purpose of this issue is to get clarity on precisely what is needed to 

1. Extend `getUserMedia()` to list a created virtual device for purposes of speech synthesis output;
2. If 1. is not going to happen (per https://github.com/w3c/mediacapture-main/issues/629; https://github.com/w3c/mediacapture-main/issues/650) then kindly clearly write out the canonical steps required to create OS agnostic code to implement the device that `getUserMedia()` is currently specified to list and have access to, so that we can feed that device the input from file or pipe directly to the `MediaStreamTrack`, so that users can implement the necessary code properly themselves.

The use cases exist. The technology exists. Am attempting to bridge the gap between an active and well-defined specification and an ostensibly non-active and ill-defined specification, incapable of being "fixed" properly without rewriting the entire specification (which cannot participate in due to the fraudulent 1,000 year ban placed on this user from contributing to WICG/spech-api). 

What are the canonical procedures to 1) extend (as defined in this specification) `MediaStreamTrack` to include a `"TTS"`, `kind` and `label` with speech synthesis engine output as **source** (as defined in this specification); and 2) programmatically create a virtual input device that `getUserMedia({audio:true})` will recognize, list, and have access to?



Please view or discuss this issue at https://github.com/w3c/mediacapture-main/issues/654 using your GitHub account
Received on Thursday, 9 January 2020 18:13:58 UTC