Copyright © 2010 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This is a proposal for adding support for speech synthesis to HTML.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document is an API proposal from Google Inc. to the HTML Speech Incubator Group. If you wish to make comments regarding this document, please send them to public-xg-htmlspeech@w3.org (subscribe, archives).
All feedback is welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.
The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [RFC2119]
Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.
Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)
User agents may impose implementation-specific limits on otherwise unconstrained inputs, e.g. to prevent denial of service attacks, to guard against running out of memory, or to work around platform-specific limitations.
Implementations that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification, as this specification uses that specification's terminology. [WEBIDL]
This section is non-normative.
The HTML Text to Speech API aims to provide web developers with programmatic access to speech synthesis and playback. The API itself is agnostic of the underlying speech synthesizer implementation and can support both server based as well as embedded synthesizers.
The API consists of a new element, tts
, with a
corresponding DOM interface HTMLTtsElement
. Like the
existing audio
and video
elements, the new tts
element extends HTMLMediaElement
. Like with the
audio
element, the playback of synthesized spech can be
controlled with a playback UI, or by scripting. The text to
synthesize can be specified in plain text, or in SSML.
All of the above should be easily extended to work in multiple languages.
Some of these use cases require speech recognition as well as speech synthesis. See HTML Speech Input API for a proposed API for speech recognition.
The following code extracts illustrate how to use speech synthesis in various cases:
<tts autoplay value="hello world">Behavior
<form> <input name="t" type="text"> <input type="button" value="Speak" onclick="var tts = document.getElementById('say'); tts.value = this.form.t.value; tts.play()" > </form> <tts id="say" lang="es">Behavior
<style type="text/css"> .current { background-color: yellow; } </style> <script type="text/javascript"> var prevLine = null; function highlight(event) { var mark = event.target.lastMark; var line = document.getElementById(mark); line.className = "current"; if (prevLine) { prevLine.className=""; } prevLine = line; } </script> <blockquote><span id="l1">Beware the Jabberwock, my son!</span><br> <span id="l2">The jaws that bite, the claws that catch!</span><br> <span id="l3">Beware the Jubjub bird, and shun</span><br> <span id="l4">The frumious Bandersnatch!</span></blockquote> <tts id="out" src="text.ssml" controls ontimeupdate="highlight">
text.ssml:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US"> <s><mark name="l1" />Beware the Jabberwock, my son!</s> <s><mark name="l2" />The jaws that bite, the claws that catch!</s> <s><mark name="l3" />Beware the Jubjub bird, and shun <mark name="l4" />the frumious Bandersnatch!<s> </speak>Behavior
This section is non-normative.
This specification is limited to adding a new HTML element for speech synthesis.
The scope of this specification does not include providing a new markup language of any kind.
The scope of this specification does not include interfacing with telephony systems of any kind.
tts
HTML elementThis API adds a new tts
element that extends
HTMLMediaElement.
interface HTMLTtsElement : HTMLMediaElement { attribute DOMString value; attribute DOMString lastMark; };
The content of the HTMLTtsElement is the data to be given as input to the speech syntheizer.
The new value
attribute
sets the content of the HTMLTtsElement to the plain text value of
the attribute.
The new lastMark
attribute must, on getting, return the name of the last SSML mark
element that was encountered during playback. If no mark has been
encountered yet, the attribute must return null.
On setting, the lastMark
must seek
to the position of the SSML mark
that the new value
refers to. If content does not include a mark
element
with the new value as name, an exception must be raised.
mark
element that was encountered during playback.
This section describes how some existing attributes of HTMLMediaElement should be interpreted when used on HTMLTtsElement.
The src
attribute
contains the URI of a document whose contents should override the content of the HTMLTtsElement.
Implementations should support at least UTF-8 encoded
text/plain
and application/ssml+xml
.
If the
src
attribute is not set, or is set to a URI that does not
reference a valid document that the user agent can use as input to
the speech synthesizer, value of the value
attribute
should be used instead. If value is not set either, the TTS element
has no content and playback should produce no audio.
The lang
attribute,
if present, sets the language in which speech should be synthesized.
If this attribute is not set the implementation
must fall back to the language of the closest ancestor that has a lang
attribute, and
finally to the language of the document. If the value to be
synthesized is SSML, any language attributes
in the SSML document override any language attirbutes in the HTML document.
All other HTMLMediaElement attributes work in the same way as for HTMLAudioElements, including
autoplay
,
loop
etc.
The existing timeupdate
event is dispatched to report
progress through the synthesized speech. If the value is SSML,
timeupdate
events should be fired for each mark
element that is encountered.
All other HTMLMediaElement events work in the same way as for HTMLAudioElements, including
play
,
ended
,
error
etc.
All HTMLMediaElement methods work in the same way as for HTMLAudioElements, including
play()
,
pause()
.
A DOM application can use the hasFeature(feature, version)
method of the
DOMImplementation
interface with parameter values "TTS" and "1.0" (respectively)
to determine whether or not this module is supported by the implementation.
Since the tts
element does not have any child
elements, the element should not be displayed in UAs that don't
support speech synthesis.
Satish Sampath, Dave Burke, Andrei Popescu, Jeremy Orlow