Re: [mediacapture-output] Directing Web Speech API audio to a specific output device? (#102)

@josephrocca There is not a simple way to get the direct output from a speech synthesis engine other than calling the engine directly and processing raw audio output. Technically, a socket connection can be established to `speech-dispatcher`

No specification, including Media Capture and Streams, Audio Output Devices API, Web Audio API, or Web SPeech API (see  [MediaStream, ArrayBuffer, Blob audio result from speak() for recording?](https://lists.w3.org/Archives/Public/public-speech-api/2017Jun/0000.html), https://github.com/WebAudio/web-audio-api-v2/issues/10#issuecomment-682259080) defines a means to access or capture speech synthesis engine output directly.

```
$ speech-dispatcher -h
Speech Dispatcher -- Common interface for Speech Synthesis (GNU GPL)

Usage: speech-dispatcher [-{d|s}] [-l {1|2|3|4|5}] [-c com_method] [-S socket_path] [-p port] [-t timeout] | [-v] | [-h]

Options: 
  -d, --run-daemon      Run as a daemon
  -s, --run-single      Run as single application
  -a, --spawn           Start only if autospawn is not disabled
  -l, --log-level       Set log level (between 1 and 5)
  -L, --log-dir         Set path to logging
  -c, --communication-method
                        Communication method to use ('unix_socket'
                        or 'inet_socket')
  -S, --socket-path     Socket path to use for 'unix_socket' method
                        (filesystem path or 'default')
  -p, --port            Specify a port number for 'inet_socket' method
  -t, --timeout         Set time in seconds for the server to wait before it
                        shuts down, if it has no clients connected
  -P, --pid-file        Set path to pid file
  -C, --config-dir      Set path to configuration
  -m, --module-dir      Set path to modules
  -v, --version         Report version of this program
  -D, --debug           Output debugging information into $TMPDIR/speechd-debug
                        if TMPDIR is exported, otherwise to /tmp/speechd-debug
  -h, --help            Print this info

Please report bugs to speechd-discuss@nongnu.org
```


Aside from more elaborate solutions that involve growing `WebAssembly.Memory` https://github.com/WebAudio/web-audio-api-v2/issues/97 and streaming monitor device from Nightly to Chromium https://gist.github.com/guest271314/04a539c00926e15905b86d05138c113c one solution is to use a local server. There are then ways to get the `MediaStreamTrack` from localhost to any origin. Note capturing monitor device captures all system audio output, not only from `speech-dispatcher` speech synthesis module.

You can use any language for a server. Here we use `php` with `espeak-ng` speech synthesis engine

speak.php
```
<?php 
  if (isset($_POST["speak"])) {
    header("Access-Control-Allow-Origin: localhost:8000");
    header("Content-Type: application/octet-stream");
    $input = urldecode($_POST["speak"]);
    $options = urldecode($_POST["options"]);
    echo passthru("espeak-ng --stdout " . $options . " '" . $input . "'");
    exit();
  }
```

Using `MediaStreamAudioSourceNode`

```
// https://stackoverflow.com/a/35248852
function int16ToFloat32(inputArray) {
    const output = new Float32Array(inputArray.length);
    for (let i = 0; i < output.length; i++) {
        const int = inputArray[i];
        // If the high bit is on, then it is a negative number, and actually counts backwards.
        const float = (int >= 0x8000) ? -(0x10000 - int) / 0x8000 : int / 0x7FFF;
        output[i] = float;
    }
    return output;
}
var fd = new FormData();
fd.append('options', '-v Storm');
fd.append('speak', `Now watch. Um, this how science works.
One researcher comes up with a result.
And that is not the truth. No, no.
A scientific emergent truth is not the
result of one experiment. What has to 
happen is somebody else has to verify
it. Preferably a competitor. Preferably
someone who doesnt want you to be correct.

- Neil deGrasse Tyson, May 3, 2017 at 92nd Street Y`);
fetch('', {method:'post', body:fd})
.then(r => r.arrayBuffer())
.then(async arrayBuffer => {
  const uint16 = new Uint16Array(arrayBuffer.slice(44));
  const floats = int16ToFloat32(uint16, 1, uint16.length);
  const ac = new AudioContext({sampleRate: 22050});
  const buffer = new AudioBuffer({
                           numberOfChannels: 1, 
                           length: floats.byteLength, 
                           sampleRate: ac.sampleRate
  });
  console.log(floats);
  buffer.getChannelData(0).set(floats);
  const absn = new AudioBufferSourceNode(ac, {buffer});
  // cannot be connected directly to AudioContext.destination
  const msd = new MediaStreamAudioDestinationNode(ac);
  const {stream: mediaStream} = msd;
  const source = new MediaStreamAudioSourceNode(ac, {mediaStream});
  absn.connect(msd);
  absn.start();
  source.connect(ac.destination);
});
```

Using `AudioWorkletNode` with a single `Float32Array` initially passed. We could write the stream `STDOUT` to a single `ArrayBuffer` or `SharedArrayBuffer` using `Response.body.getReader()` and read from that memory in `process()`

```
// https://stackoverflow.com/a/35248852
function int16ToFloat32(inputArray) {
    const output = new Float32Array(inputArray.length);
    for (let i = 0; i < output.length; i++) {
        const int = inputArray[i];
        // If the high bit is on, then it is a negative number, and actually counts backwards.
        const float = (int >= 0x8000) ? -(0x10000 - int) / 0x8000 : int / 0x7FFF;
        output[i] = float;
    }
    return output;
}
var fd = new FormData();
fd.append('options', '-v Storm');
fd.append('speak', `Now watch. Um, this how science works.
One researcher comes up with a result.
And that is not the truth. No, no.
A scientific emergent truth is not the
result of one experiment. What has to 
happen is somebody else has to verify
it. Preferably a competitor. Preferably
someone who doesnt want you to be correct.

- Neil deGrasse Tyson, May 3, 2017 at 92nd Street Y`);
fetch('', {method:'post', body:fd})
.then(r => r.arrayBuffer())
.then(async arrayBuffer => {
  const uint16 = new Uint16Array(arrayBuffer.slice(44));
  const floats = int16ToFloat32(uint16, 1, uint16.length);
  const ac = new AudioContext({sampleRate: 22050});
  console.log(ac.state);
  class AudioWorkletProcessor {}
          class SpeechSynthesisStream extends AudioWorkletProcessor {
            constructor(options) {
              super(options);
              Object.assign(this, options.processorOptions);
              globalThis.console.log(this.floats);
              this.port.postMessage({start:this.start = !this.start});
            }
            endOfStream() {
              this.port.postMessage({
                ended: true,
                currentTime,
                currentFrame,
                readOffset: this.readOffset,
              });
            }
            process(inputs, outputs) {

              const [channel] = outputs.flat();
              if (
                this.readOffset >= this.floats.length
              ) {
                console.log(this);
                this.endOfStream();
                return false;
              }
              const data = Float32Array.from({length: 128}, _ => {
                const index = this.readOffset;
                if (index > this.floats.length) return 0;
                return this.floats[this.readOffset++];
              });
              channel.set(data);
              return true;
            }
          }
          // register processor in AudioWorkletGlobalScope
          function registerProcessor(name, processorCtor) {
            return `${processorCtor};\nregisterProcessor('${name}', ${processorCtor.name});`;
          }
          const worklet = URL.createObjectURL(
            new Blob(
              [
                registerProcessor(
                  'speech-synthesis-stream',
                  SpeechSynthesisStream
                ),
              ],
              { type: 'text/javascript' }
            )
          );
            ac.onstatechange = e => console.log(ac.state);

          await ac.audioWorklet.addModule(worklet);
          const aw = new AudioWorkletNode(
            ac,
            'speech-synthesis-stream',
            {
              numberOfInputs: 1,
              numberOfOutputs: 1,
              channelCount: 1,
              processorOptions: {
                readOffset: 0,
                ended: false,
                start: false,
                floats
              },
            }
          );

          aw.onprocessorerror = e => {
            console.error(e);
            console.trace();
          };
          const msd = new MediaStreamAudioDestinationNode(ac);
          const { stream } = msd;
          const [track] = stream.getAudioTracks();
          aw.connect(msd);
          aw.connect(ac.destination);
          // const recorder = new MediaRecorder(stream);
          // recorder.ondataavailable = e => console.log(URL.createObjectURL(e.data));
          if (ac.state === 'running') {
            await ac.suspend();
          }
          aw.port.onmessage = async e => {
            console.log(e.data, ac.state);
            if (
              e.data.start &&
              ac.state === 'suspended'
            ) {
             await ac.resume();
              // recorder.start();  
            } else {
              // if (recorder.state === 'recording') {             
              // recorder.stop();
              track.stop();
              aw.disconnect();
              msd.disconnect();
              await ac.close();
              console.log(track);
           }          
          };
});
```

-- 
GitHub Notification of comment by guest271314
Please view or discuss this issue at https://github.com/w3c/mediacapture-output/issues/102#issuecomment-691608499 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config

Received on Sunday, 13 September 2020 04:17:26 UTC