Client-side, Server-side and Third-party Speech Recognition, Synthesis and Translation from Adam Sobieski on 2018-09-15 (public-speech-api@w3.org from September 2018)

From: Adam Sobieski <adamsobieski@hotmail.com>
Date: Sat, 15 Sep 2018 06:35:43 +0000
To: "public-speech-api@w3.org" <public-speech-api@w3.org>
Message-ID: <CY4PR0101MB309564A2D9B2B655760DC48DC5180@CY4PR0101MB3095.prod.exchangelabs.com>
Introduction

We can envision and consider client-side, server-side and third-party speech recognition, synthesis and translation scenarios for a next version of the Web Speech API.

Advancing the State of the Art
Speech Recognition

Beyond speech-to-text, speech recognition includes speech-to-SSML and speech-to-hypertext. With speech-to-SSML and speech-to-hypertext, there can be a higher degree of fidelity possible for round-tripping speech audio through speech recognition and synthesis components or services.

Speech Synthesis

Beyond text-to-speech, speech synthesis includes SSML-to-speech and hypertext-to-speech<https://github.com/w3c/speech-api/issues/36>.

Translation

Translation scenarios include processing text, SSML, hypertext or audio in a source language into text, SSML, hypertext or audio in a target language.

Desirable features include interoperability between client-side, server-side and third-party translation and WebRTC<https://www.w3.org/TR/webrtc/> with translations available as subtitles or audio tracks.

Multimodal Dialogue Systems

Interesting scenarios include Web-based multimodal dialogue systems which efficiently utilize client-side, server-side and third-party speech recognition, synthesis and translation.

Client-side Scenarios
Client-side Speech Recognition

These scenarios are considered in the current version of the Web Speech API.

Client-side Speech Synthesis

These scenarios are considered in the current version of the Web Speech API.

Client-side Translation

These scenarios are new to the Web Speech API and involve the client-side translation of text, SSML, hypertext or audio into text, SSML, hypertext or audio.

Server-side Scenarios
Server-side Speech Recognition

These scenarios are new to the Web Speech API and involve one or more audio streams from a client being streamed to a server which performs speech recognition, optionally providing speech recognition results to the client.

Server-side Speech Synthesis

These scenarios are new to the Web Speech API and involve a client sending text, SSML or hypertext to a server which performs speech synthesis and streams audio to the client.

Server-side Translation

These scenarios are new to the Web Speech API and involve a client sending text, SSML, hypertext or audio to a server for translation into text, SSML, hypertext or audio.

Third-party Scenarios
Third-party Speech Recognition

These scenarios are new to the Web Speech API and involve one or more audio streams from a client or server being streamed to a third-party service which performs speech recognition providing speech recognition results to the client or server.

Third-party Speech Synthesis

These scenarios are new to the Web Speech API and involve a client or server sending text, SSML or hypertext to a third-party service which performs speech synthesis and streams audio to the client or server.

Third-party Translation

These scenarios are new to the Web Speech API and involve a client sending text, SSML, hypertext or audio to a third-party translation service for translation into text, SSML, hypertext or audio.

Hyperlinks

Amazon Web Services<https://aws.amazon.com/>

  *   Speech to Text<https://aws.amazon.com/transcribe/>
  *   Text to Speech<https://aws.amazon.com/polly/>
  *   Translation<https://aws.amazon.com/translate/>

Google Cloud AI<https://cloud.google.com/products/ai/>

  *   Speech to Text<https://cloud.google.com/speech-to-text/>
  *   Text to Speech<https://cloud.google.com/text-to-speech/>
  *   Translation<https://cloud.google.com/translate/>

IBM Watson Products and Services<https://www.ibm.com/watson/products-services/>

  *   Speech to Text<https://www.ibm.com/watson/services/speech-to-text/>
  *   Text to Speech<https://www.ibm.com/watson/services/text-to-speech/>
  *   Translation<https://www.ibm.com/watson/services/language-translator/>

Microsoft Cognitive Services<https://azure.microsoft.com/en-us/services/cognitive-services/>

  *   Speech to Text<https://azure.microsoft.com/en-us/services/cognitive-services/speech-to-text/>
  *   Text to Speech<https://azure.microsoft.com/en-us/services/cognitive-services/text-to-speech/>
  *   Translation<https://azure.microsoft.com/en-us/services/cognitive-services/speech-translation/>

Real Time Translation in WebRTC<https://www.youtube.com/watch?v=EPBWR_GNY9U>



Best regards,
Adam Sobieski

P.S.: https://github.com/w3c/speech-api/issues/41
Received on Saturday, 15 September 2018 06:36:07 UTC