W3C

HTML Text to Speech (TTS) API Specification

Editor's Draft 28 October 2010

Latest Editor's Draft:
http://dev.w3.org/...
Editors:
Bjorn Bringert, Google Inc.

Abstract

This is a proposal for adding support for speech synthesis to HTML.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document is an API proposal from Google Inc. to the HTML Speech Incubator Group. If you wish to make comments regarding this document, please send them to public-xg-htmlspeech@w3.org (subscribe, archives).

All feedback is welcome.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1 Conformance requirements

All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.

The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [RFC2119]

Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.

Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)

User agents may impose implementation-specific limits on otherwise unconstrained inputs, e.g. to prevent denial of service attacks, to guard against running out of memory, or to work around platform-specific limitations.

Implementations that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification, as this specification uses that specification's terminology. [WEBIDL]

2 Introduction

This section is non-normative.

The HTML Text to Speech API aims to provide web developers with programmatic access to speech synthesis and playback. The API itself is agnostic of the underlying speech synthesizer implementation and can support both server based as well as embedded synthesizers.

The API consists of a new element, tts, with a corresponding DOM interface HTMLTtsElement. Like the existing audio and video elements, the new tts element extends HTMLMediaElement. Like with the audio element, the playback of synthesized spech can be controlled with a playback UI, or by scripting. The text to synthesize can be specified in plain text, or in SSML.

Use Cases

Speech translation
The app works as an interpreter between two users that speak different languages.
Speech-enabled webmail client, e.g. for in-car use.
Reads out e-mails and gives confirmations for commands processed such as "e-mail sent to Bob".
Turn-by-turn navigation
Speaks driving instructions, e.g. "in 500 meters, left turn on Buckingham Palace Road".
Dialog systems
For exmaple flight booking, pizza ordering.

All of the above should be easily extended to work in multiple languages.

Some of these use cases require speech recognition as well as speech synthesis. See HTML Speech Input API for a proposed API for speech recognition.

Examples

The following code extracts illustrate how to use speech synthesis in various cases:

Hello World

    <tts autoplay value="hello world">
   
Behavior
  1. "hello world" is spoken when the page has loaded.
  2. In browsers that don't support TTS, the text "hello world" is displayed.

Speak Spanish text typed by the user

    <form>
      <input name="t" type="text">
      <input type="button" value="Speak" onclick="var tts = document.getElementById('say'); tts.value = this.form.t.value; tts.play()" >
    </form>
    <tts id="say" lang="es">
   
Behavior
  1. The text typed in the input field is spoken in Spanish when the button is pressed.

Read out text, highlighting current word

    <style type="text/css">
      .current { background-color: yellow; }
    </style>
    <script type="text/javascript">
      var prevLine = null;
      function highlight(event) {
        var mark = event.target.lastMark;
        var line = document.getElementById(mark);
        line.className = "current";
        if (prevLine) { prevLine.className=""; }
        prevLine = line;
      }
    </script>
    <blockquote><span id="l1">Beware the Jabberwock, my son!</span><br>
       <span id="l2">The jaws that bite, the claws that catch!</span><br>
       <span id="l3">Beware the Jubjub bird, and shun</span><br>
       <span id="l4">The frumious Bandersnatch!</span></blockquote>

    <tts id="out" src="text.ssml" controls ontimeupdate="highlight">
   

text.ssml:

     <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xml:lang="en-US">
       <s><mark name="l1" />Beware the Jabberwock, my son!</s>
       <s><mark name="l2" />The jaws that bite, the claws that catch!</s>
       <s><mark name="l3" />Beware the Jubjub bird, and shun <mark name="l4" />the frumious Bandersnatch!<s>
     </speak>
   
Behavior
  1. The TTS element shows playback controls.
  2. When play is pressed, the synthesized speech is played back.
  3. When a new line starts to play back, that line is highlighted.

3 Scope

This section is non-normative.

This specification is limited to adding a new HTML element for speech synthesis.

The scope of this specification does not include providing a new markup language of any kind.

The scope of this specification does not include interfacing with telephony systems of any kind.

4 API Description

4.1 The tts HTML element

This API adds a new tts element that extends HTMLMediaElement.

  interface HTMLTtsElement : HTMLMediaElement {

            attribute DOMString value;
   readonly attribute DOMString lastMark;

  };
  

The content of the HTMLTtsElement is the data to be given as input to the speech syntheizer.

The new value attribute sets the content of the HTMLTtsElement to the plain text value of the attribute.

The new lastMark attribute contains the name of the last SSML mark element that was encountered during playback.

4.1.1 Notes about existing attributes, events and methods

This section describes how some existing attributes of HTMLMediaElement should be interpreted when used on HTMLTtsElement.

The src attribute contains the URI of a document whose contents should override the content of the HTMLTtsElement. Implementations should support at least UTF-8 encoded text/plain and application/ssml+xml.

If the src attribute is not set, or is set to a URI that does not reference a valid document that the user agent can use as input to the speech synthesizer, value of the value attribute should be used instead. If value is not set either, the TTS element has no content and playback should produce no audio.

The lang attribute, if present, sets the language in which speech should be synthesized. If this attribute is not set the implementation must fall back to the language of the closest ancestor that has a lang attribute, and finally to the language of the document. If the value to be synthesized is SSML, any language attributes in the SSML document override any language attirbutes in the HTML document.

All other HTMLMediaElement attributes work in the same way as for HTMLAudioElements, including autoplay, loop etc.

The existing timeupdate event is dispatched to report progress through the synthesized speech. If the value is SSML, timeupdate events should be fired for each mark element that is encountered.

All other HTMLMediaElement events work in the same way as for HTMLAudioElements, including play, ended, error etc.

All HTMLMediaElement methods work in the same way as for HTMLAudioElements, including play(), pause().

5 Backwards Compatibility

A DOM application can use the hasFeature(feature, version) method of the DOMImplementation interface with parameter values "TTS" and "1.0" (respectively) to determine whether or not this module is supported by the implementation.

Since the tts element does not have any child elements, the element should not be displayed in UAs that don't support speech synthesis.

Acknowledgments

Satish Sampath, Dave Burke, Andrei Popescu, Jeremy Orlow

References

[RFC3066]
Tags for the Identification of Languages, Harald Tveit Alvestrand. Internet Engineering Task Force, January 2001. See http://www.ietf.org/rfc/rfc3066.txt
[WEBIDL]
Web IDL, Cameron McCormack, Editor. World Wide Web Consortium, 19 December 2008. See http://dev.w3.org/2006/webapi/WebIDL/
[SSML]
Speech Synthesis Markup Language (SSML) Version 1.1, Daniel C. Burnett, Zhi Wei Shuang, Editors, W3C Recommendation, 7 September 2010.
[HTML5]
HTML5 A vocabulary and associated APIs for HTML and XHTML, Ian Hickson, Editor, W3c Editor's Draft 25 October 2010.