- From: Weston Ruter <westonruter@gmail.com>
- Date: Sun, 13 Dec 2009 09:07:07 -0800
I blogged yesterday about this topic (including a text-to-speech demo using HTML5 Audio and Google Translate's TTS service); the more relevant part for this thread: <http://weston.ruter.net/projects/google-tts/> I am really excited at the prospect of text-to-speech being made available > on > the Web! It's just too bad that fetching MP3s on an remote web service is > the > only standard way of doing so currently; modern operating systems all have > TTS > capabilities, so it's a shame that web apps and can't utilize them via > client-side scripting. I posted to the WHATWG mailing list about such a > Text-To-Speech (TTS) Web API for JavaScript, and I was directed to a recent > thread about a Web API for speech recognition and synthesis. > > Perhaps there is some momentum building here? Having TTS available in the > browser would boost accessibility for the seeing-impaired and improve > usability > for people on-the-go. TTS is just another technology that has traditionally > been > relegated to desktop applications, but as the open Web advances as the > preferred > platform for application development, it is an essential service to make > available (as with Geolocation API, Device API, etc.). And besides, I want > to > build TTS applications and my motto is: "If it can't be done on the open > web, > it's not worth doing at all"! > http://weston.ruter.net/projects/google-tts/ Weston On Fri, Dec 11, 2009 at 1:35 PM, Weston Ruter <westonruter at gmail.com> wrote: > I was just alerted about this thread from my post "Text-To-Speech (TTS) Web > API for JavaScript" at < > http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-December/024453.html>. > Amazing how shared ideas like these seem to arise independently at the same > time. > > I have a use-case and an additional requirement, that the time indices be > made available for when each word is spoken in the TTS-generated audio: > > I've been working on a web app which reads text in a web page, highlighting >> each word as it is read. For this to be possible, a Text-To-Speech API is >> needed which is able to: >> (1) generate the speech audio from some text, and >> (2) include the time indicies for when each of the words in the text is >> spoken. >> > > I foresee that a TTS API should integrate closely with the HTML5 Audio API. > For example, invoking a call to the API could return a "TTS" object which > has an instance of Audio, whose interface could be used to navigate through > the TTS output. For example: > > var tts = new TextToSpeech("Hello, World!"); > tts.audio.addEventListener("canplaythrough", function(e){ > //tts.indices == [{startTime:0, endTime:500, text:"Hello"}, > {startTime:500, endTime:1000, text:"World"}] > }, false); > tts.read(); //invokes tts.audio.play > > What would be even cooler, is if the parameter passed to the TextToSpeech > constructor could be an Element or TextNode, and the indices would then > include a DOM Range in addition to the "text" property. A flag could also be > set which would result in each of these DOM ranges to be selected when it is > read. For example: > > var tts = new TextToSpeech(document.querySelector("article")); > tts.selectRangesOnRead = true; > tts.audio.addEventListener("canplaythrough", function(e){ > /* > tts.indices == [ > {startTime:0, endTime:500, text:"Hello", range:Range}, > {startTime:500, endTime:1000, text:"World", range:Range} > ] > */ > }, false); > tts.read(); > > In addition to the events fired by the Audio API, more events could be > fired when reading TTS, such as a "readrange" event whose event object would > include the index (startTime, endTime, text, range) for the range currently > being spoken. Such functionality would make the ability to "read along" with > the text trivial. > > What do you think? > Weston > > > On Thu, Dec 3, 2009 at 4:06 AM, Bjorn Bringert <bringert at google.com>wrote: > >> On Wed, Dec 2, 2009 at 10:20 PM, Jonas Sicking <jonas at sicking.cc> wrote: >> > On Wed, Dec 2, 2009 at 11:17 AM, Bjorn Bringert <bringert at google.com> >> wrote: >> >> I agree that being able to capture and upload audio to a server would >> >> be useful for a lot of applications, and it could be used to do speech >> >> recognition. However, for a web app developer who just wants to >> >> develop an application that uses speech input and/or output, it >> >> doesn't seem very convenient, since it requires server-side >> >> infrastructure that is very costly to develop and run. A >> >> speech-specific API in the browser gives browser implementors the >> >> option to use on-device speech services provided by the OS, or >> >> server-side speech synthesis/recognition. >> > >> > Again, it would help a lot of you could provide use cases and >> > requirements. This helps both with designing an API, as well as >> > evaluating if the use cases are common enough that a dedicated API is >> > the best solution. >> > >> > / Jonas >> >> I'm mostly thinking about speech web apps for mobile devices. I think >> that's where speech makes most sense as an input and output method, >> because of the poor keyboards, small screens, and frequent hands/eyes >> busy situations (e.g. while driving). Accessibility is the other big >> reason for using speech. >> >> Some ideas for use cases: >> >> - Search by speaking a query >> - Speech-to-speech translation >> - Voice Dialing (could open a tel: URI to actually make the call) >> - Dialog systems (e.g. the canonical pizza ordering system) >> - Lightweight JavaScript browser extensions (e.g. Greasemonkey / >> Chrome extensions) for using speech with any web site, e.g, for >> accessibility. >> >> Requirements: >> >> - Web app developer side: >> - Allows both speech recognition and synthesis. >> - Easy to use API. Makes simple things easy and advanced things >> possible. >> - Doesn't require web app developer to develop / run his own speech >> recognition / synthesis servers. >> - (Natural) language-neutral API. >> - Allows developer-defined application specific grammars / language >> models. >> - Allows multilingual applications. >> - Allows easy localization of speech apps. >> >> - Implementor side: >> - Easy enough to implement that it can get wide adoption in browsers. >> - Allows implementor to use either client-side or server-side >> recognition and synthesis. >> >> -- >> Bjorn Bringert >> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham >> Palace Road, London, SW1W 9TQ >> Registered in England Number: 3977902 >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.whatwg.org/pipermail/whatwg-whatwg.org/attachments/20091213/9e7bbbac/attachment-0001.htm>
Received on Sunday, 13 December 2009 09:07:07 UTC