Re: [mst-content-hint] Differentiate between speech for human and machine consumption (#39) from Samuel Dallstream via GitHub on 2020-04-15 (public-webrtc-logs@w3.org from April 2020)

From: Samuel Dallstream via GitHub <sysbot+gh@w3.org>
Date: Wed, 15 Apr 2020 19:20:23 +0000
To: public-webrtc-logs@w3.org
Message-ID: <issue_comment.created-614231806-1586978422-sysbot+gh@w3.org>

@guest271314 I agree with your point that there is a need to add a stream form of output for the Speech-API. I think there are some limitations with the current implementations that make this difficult (the OS speech generator is the thing speaking words to you instead of the Web Platform, so the browser currently has no access to anything like an audio stream). To fix this would require interest from the web community (and there seems to be some) as well as interest from implementers (not sure about this), and a solid path forward in the technical sense. Windows provides an audio stream in some its newer APIs which might be useable for this scenario... but I'm not sure about other platforms. Maybe it would be possible cross-platform with Chromium's chrome.ttsEngine API, which allows for developers to create their own TTS engines.

However, I think that this is off topic for the issue that this thread is based on. The issue above is pointing out an issue that developers are facing, where they can currently mark a stream for "speech" in the sense that it is being used for communications, but they would also like to mark a stream as being used for speech recognition, so that consumers, or even the platform, can make appropriate adjustments for that use case.

Appropriate adjustments for speech recognition include anything that will increase precision and accuracy of speech recognition machines/services. There has been some effort to standardize what scenarios should optimized with regards to speech recognition, and there are well established standards for communications. If you look through the requirements of both you will find differences between them that are at odds (one example is that communications combine adding pleasant background noise with noise suppression, which is at odds with the goal of signal preservation for most speech recognition engines).

Here are links to standards documents that illustrate some of the differences between the two use cases: [Communication](https://www.etsi.org/deliver/etsi_ts/126100_126199/126131/12.03.00_60/ts_126131v120300p.pdf), [Speech Recognition](https://drive.google.com/file/d/1y_i7NkXbCuRWznYRl9dacy3xDdH2e7-m/view?usp=sharing)

-- 
GitHub Notification of comment by sjdallst
Please view or discuss this issue at https://github.com/w3c/mst-content-hint/issues/39#issuecomment-614231806 using your GitHub account

Received on Wednesday, 15 April 2020 19:20:26 UTC