HTML Text to Speech (TTS) API Specification

Editor's Draft 28 October 2010

Latest Editor's Draft:: http://dev.w3.org/...
Editors:: Bjorn Bringert, Google Inc.

Abstract

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

1 Conformance requirements

All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.

The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [RFC2119]

Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.

Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)

User agents may impose implementation-specific limits on otherwise unconstrained inputs, e.g. to prevent denial of service attacks, to guard against running out of memory, or to work around platform-specific limitations.

Implementations that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification, as this specification uses that specification's terminology. [WEBIDL]

2 Introduction

The HTML Text to Speech API aims to provide web developers with programmatic access to speech synthesis and playback. The API itself is agnostic of the underlying speech synthesizer implementation and can support both server based as well as embedded synthesizers.

The API consists of a new element, tts, with a corresponding DOM interface HTMLTtsElement. Like the existing audio and video elements, the new tts element extends HTMLMediaElement. Like with the audio element, the playback of synthesized spech can be controlled with a playback UI, or by scripting. The text to synthesize can be specified in plain text, or in SSML.

Use Cases

Some of these use cases require speech recognition as well as speech synthesis. See HTML Speech Input API for a proposed API for speech recognition.

Examples

The following code extracts illustrate how to use speech synthesis in various cases:

Hello World

    <tts autoplay value="hello world">

Behavior

"hello world" is spoken when the page has loaded.
In browsers that don't support TTS, the text "hello world" is displayed.

Speak Spanish text typed by the user

    <form>
      <input name="t" type="text">
      <input type="button" value="Speak" onclick="var tts = document.getElementById('say'); tts.value = this.form.t.value; tts.play()" >
    </form>
    <tts id="say" lang="es">

Behavior

The text typed in the input field is spoken in Spanish when the button is pressed.

Read out text, highlighting current word

    <style type="text/css">
      .current { background-color: yellow; }
    </style>
    <script type="text/javascript">
      var prevLine = null;
      function highlight(event) {
        var mark = event.target.lastMark;
        var line = document.getElementById(mark);
        line.className = "current";
        if (prevLine) { prevLine.className=""; }
        prevLine = line;
      }
    </script>
    <blockquote><span id="l1">Beware the Jabberwock, my son!</span><br>
       <span id="l2">The jaws that bite, the claws that catch!</span><br>
       <span id="l3">Beware the Jubjub bird, and shun</span><br>
       <span id="l4">The frumious Bandersnatch!</span></blockquote>

    <tts id="out" src="text.ssml" controls ontimeupdate="highlight">

text.ssml:

     <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xml:lang="en-US">
       <s><mark name="l1" />Beware the Jabberwock, my son!</s>
       <s><mark name="l2" />The jaws that bite, the claws that catch!</s>
       <s><mark name="l3" />Beware the Jubjub bird, and shun <mark name="l4" />the frumious Bandersnatch!<s>
     </speak>

Behavior

The TTS element shows playback controls.
When play is pressed, the synthesized speech is played back.
When a new line starts to play back, that line is highlighted.

3 Scope

This specification is limited to adding a new HTML element for speech synthesis.

The scope of this specification does not include providing a new markup language of any kind.

The scope of this specification does not include interfacing with telephony systems of any kind.

4 API Description

4.1 The tts HTML element

The content of the HTMLTtsElement is the data to be given as input to the speech syntheizer.

The new value attribute sets the content of the HTMLTtsElement to the plain text value of the attribute.

The new lastMark attribute contains the name of the last SSML mark element that was encountered during playback.

4.1.1 Notes about existing attributes, events and methods

This section describes how some existing attributes of HTMLMediaElement should be interpreted when used on HTMLTtsElement.

The src attribute contains the URI of a document whose contents should override the content of the HTMLTtsElement. Implementations should support at least UTF-8 encoded text/plain and application/ssml+xml.

If the src attribute is not set, or is set to a URI that does not reference a valid document that the user agent can use as input to the speech synthesizer, value of the value attribute should be used instead. If value is not set either, the TTS element has no content and playback should produce no audio.

The lang attribute, if present, sets the language in which speech should be synthesized. If this attribute is not set the implementation must fall back to the language of the closest ancestor that has a lang attribute, and finally to the language of the document. If the value to be synthesized is SSML, any language attributes in the SSML document override any language attirbutes in the HTML document.

All other HTMLMediaElement attributes work in the same way as for HTMLAudioElements, including autoplay, loop etc.

The existing timeupdate event is dispatched to report progress through the synthesized speech. If the value is SSML, timeupdate events should be fired for each mark element that is encountered.

All other HTMLMediaElement events work in the same way as for HTMLAudioElements, including play, ended, error etc.

All HTMLMediaElement methods work in the same way as for HTMLAudioElements, including play(), pause().

5 Backwards Compatibility

A DOM application can use the hasFeature(feature, version) method of the DOMImplementation interface with parameter values "TTS" and "1.0" (respectively) to determine whether or not this module is supported by the implementation.

Since the tts element does not have any child elements, the element should not be displayed in UAs that don't support speech synthesis.

HTML Text to Speech (TTS) API Specification

Editor's Draft 28 October 2010

Abstract

Status of This Document

Table of Contents

1 Conformance requirements

2 Introduction

Use Cases

Examples

Hello World

Speak Spanish text typed by the user

Read out text, highlighting current word

3 Scope

4 API Description

4.1 The `tts` HTML element

4.1.1 Notes about existing attributes, events and methods

5 Backwards Compatibility

Acknowledgments

References

HTML Text to Speech (TTS) API Specification

Editor's Draft 28 October 2010

Abstract

Status of This Document

Table of Contents

1 Conformance requirements

2 Introduction

Use Cases

Examples

Hello World

Speak Spanish text typed by the user

Read out text, highlighting current word

3 Scope

4 API Description

4.1 The tts HTML element

4.1.1 Notes about existing attributes, events and methods

5 Backwards Compatibility

Acknowledgments

References

4.1 The `tts` HTML element