HTML Speech Incubator Group
TTS Speech API Work Items
 (Internal Draft)

Note 21 July 2011

Editors:
Contributors
Authors of related drafts

Abstract

This document addresses the following work items related to the TTS Speech API:

  1. The API hooks related to actually doing a synthesis transaction.
  2. The synthesis events that are raised and the associated handlers and data (same caveat about timing as with 6).
  3. The API hooks for controlling synthesis, if any (pause, resume, play, etc.).

Status of this Document

To be incorporated into the HTML Speech Incubator Group Final Report (Internal Draft).


1 Terminology

The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL in this specification are to be interpreted as described in [IETF RFC 2119].

2 Overview

This document presents input for the deliverables of the HTML Speech Incubator Group.

3 Deliverables

According to the charter, the group is to produce one deliverable, this document. It goes on to state that the document may include

The group has developed requirements, some with use cases, and has made progress towards one or more API proposals that are effectively change requests to other existing standard specifications. These subdeliverables follow.

3.1 Prioritized Requirements

The HTML Speech Incubator Group developed and prioritized requirements as described in the Requirements and use cases document. A summary of the results is presented below with requirements listed in priority order, and segmented into those with strong interest, those with moderate interest, and those with mild interest. Each requirement is linked to its description in the requirements document.

3.5 Solution Design Agreements and Alternatives

This section attempts to capture the major design decisions the group made. In cases where substantial disagreements existed, the relevant alternatives are presented rather than a decision. Note that text only went into this section if it either represented group consensus or an accurate description of the specific alternative, as appropriate.

3.5.1 General Design Decisions

  1. There are three aspects to the solution which must be addressed: communication with and control of speech services, a script-level API, and markup-level hooks and capabilities.
  2. The script API will be Javascript.
  3. The scripting API is the primary focus, with all key functionality available via scripting. Any HTML markup capabilities, if present, will be based completely on the scripting capabilities.
  4. Notifications from the user agent to the web application should be in the form of Javascript events/callbacks.
  5. For TTS, there must be at least these two logical functions:
    1. play
    2. pause
    There is agreement that it should be possible to stop playback, but there is not agreement on the need for an explicit stop function.
  6. It must be possible for a web application to specify the speech engine.
  7. Speech service implementations must be referenceable by URI.
  8. For TTS, SSML 1.1 is mandatory to support, as is UTF-8 plain text. These are the only mandated formats.
  9. There must be no technical restriction that would prevent using only TTS or only ASR.
  10. There must be no technical restriction that would prevent implementing only TTS or only ASR. There is *mostly* agreement on this.
  11. There will be a mandatory set of capabilities with stated limitations on interoperability.
  12. There should be a default user interface.
  13. We expect to have the following six audio/speech events: onaudiostart/onaudioend, onsoundstart/onsoundend, onspeechstart/onspeechend. The onsound* events represent a "probably speech but not sure" condition, while the onspeech* events represent the recognizer being sure there's speech. The former are low latency. An end event can only occur after at least one start event of the same type has occurred. Only the user agent can generate onaudio* events, the energy detector can only generate onsound* events, and the speech service can only generate onspeech* events.
  14. There are 3 classes of codecs: audio to the web-app specified ASR engine, recognition from existing audio (e.g., local file), and audio from the TTS engine. We need to specify a mandatory-to-support codec for each.
  15. It must be possible to specify and use other codecs in addition to those that are mandatory-to-implement.
  16. TTS: Support for streaming audio is required -- in particular, that TTS may begin playing before the synthesizer has finished synthesizing.
  17. We will require support for http for all communication between the user agent and any selected engine, including chunked http for media streaming, and support negotiation of other protocols (such as WebSockets or whatever RTCWeb/WebRTC comes up with).
  18. The scripting API communicates its parameter settings by sending them in the body of a POST request as Media Type "multipart". The subtype(s) accepted (e.g., mixed, formdata) are TBD.
  19. TTS: If a TTS engine allows parameters to be specified in the URI in addition to in the POST body, when a parameter is specified in both places the one in the body takes precedence. This has the effect of making parameters set in the URI be treated as default values.
  20. We cannot expect consistency in language support and performance/quality.
  21. We agree that there must be API-level consistency regardless of user agent and engine.
  22. We agree on having the same level of consistency across all four of the following categories:
    1. consistency between different UAs using their default engine
    2. consistency between different UAs using web app specified engine
    3. consistency between different UAs using different web specified engines
    4. consistency between default engine and specified engines
    With exception that #4 may have limitations due to privacy issues.
  23. From this point on we will use "service" rather than "engine" because a service may be a proxy for more than one engine.
  24. We will not support selection of service by characteristics.
  25. Add to list of expected inconsistency (change from existing wording of interoperability): reco performance including maximum size on parameters, microphone characteristics, semantics and exact values of sensitivity and confidence, time need to perform ASR/TTS, latencies, endpoint sensitivity and latency, result contents, presence/absence of optional events, recorded waveform
  26. If the user's device is emitting other sounds than those produced by the current HTML page, there is no particular requirement that the User Agent be required to detect/reduce/eliminate it.
  27. If a web app specifies a speech service and it is not available, an error is thrown. No automatic fallback to another service or the default service takes place.
  28. The API should provide a way to determine if a service is available before trying to use the service; this applies to the default service as well.
  29. The API must provide a way to query the availability of a specific configuration of a service.
  30. The API must provide a way to ask the user agent for the capabilities of a service. In the case of private information that the user agent may have when the default service is selected, the user agent may choose to answer with "no comment" (or equivalent).
  31. Informed user consent is required for all use of private information. This includes list of languages for ASR and voices for TTS. When such information is requested by the web app or speech service and permission is refused, the API must return "no comment" (or equivalent).
  32. It must be possible for user permission to be granted at the level of specific web apps and/or speech services.
  33. User agents, acting on behalf of the user, may deny the use of specific web apps and/or speech services.
  34. The API will support multiple simultaneous requests to speech services (same or different, ASR and TTS).
  35. We disagree about whether there needs to be direct API support for a single ASR request and single TTS request that are tied together.
  36. It must be possible to individually control ASR and TTS.
  37. It must be possible for the web app author to get timely information about recognition event timing and about TTS playback timing. It must be possible for the web app author to determine, for any specific UA local time, what the previous TTS mark was and the offset from that mark.
  38. It must be possible for the web app to stop/pause/silence audio output directly at the client/user agent.
  39. When audio corresponding to TTS mark location begins to play, a Javascript event must be fired, and the event must contain the name of the mark and the UA timestamp for when it was played.
  40. It must be possible to specify service-specific parameters in both the URI and the message body. It must be clear in the API that these parameters are service-specific, i.e., not standard.
  41. Every message from UA to speech service should send the UA-local timestamp.
  42. API must have ability to set service-specific parameters using names that clearly identify that they are service-specific, e.g., using an "x-" prefix. Parameter values can be arbitrary Javascript objects.
  43. EMMA already permits app-specific result info, so there is no need to provide other ways for service-specific information to be returned in the result.
  44. The API must support DOM 3 extension events as defined (which basically require vendor prefixes). See http://www.w3.org/TR/2009/WD-DOM-Level-3-Events-20090908/#extending_events-Vendor_Extensions. It must allow the speech service to fire these events.
  45. The protocol must send its current timestamp to the speech service when it sends its first audio data.
  46. It must be possible for the speech service to instruct the UA to fire a vendor-specific event when a specific offset to audio playback start is reached by the UA. What to do if audio is canceled, paused, etc. is TBD.
  47. HTTPS must also be supported.
  48. Using web app in secure communication channel should be treated just as when working with all secured sites (e.g., with respect to non-secured channel for speech data).
  49. Default speech service implementations are encouraged not to use unsecured network communication when started by a web app in a secure communication channel
  50. In Javascript will be able to set parameters as dot properties and also via a getParameters method. Browser should also allow service-specific parameters to be set this way.
  51. Once there is a way (defined by another group) to get access to some blob of stored audio, we will support re-recognition of it.
  52. No explicit need for JSON format of EMMA, but we might use it if it existed.
  53. Candidate codecs to consider are Speex, FLAC, and Ogg Vorbis, in addition to plain old mu-law/a-law/linear PCM.
  54. Protocol design should not prevent implementability of low-latency event delivery.
  55. Protocol should support the client to begin TTS playback before receipt of all of the audio.
  56. We will not require support for video codecs. However, protocol design must not prohibit transmission of codecs that have the same interface requirements as audio codecs.
  57. Every event from speech service to the user agent must include timing information that the UA can convert into a UA-local timestamp. This timing info must be for the occurrence represented by the event, not the event time itself. For example, an end-of-speech event would contain timing for the actual end of speech, not the time when the speech service realizes end of speech occurred or when the event is sent.

3.5.2 Speech Service Communication and Control Design Decisions

This is where design decisions regarding control of and communication with remote speech services, including media negotiation and control, will be recorded.

3.5.3 Script API Design Decisions

This is where design decisions regarding the script API capabilities and realization will be recorded.

  • It must be possible to define at least the following handlers (names TBD):
    • onspeechstart (not yet clear precisely what start of speech means)
    • onspeechend (not yet clear precisely what end of speech means)
    • onerror (one or more handlers for errors)
    • a handler for when the recognition result is available
    Note: significant work is needed to get interoperability here.

3.5.4 Markup API Design Decisions

This is where design decisions regarding the markup changes and/or enhancements will be recorded.

3.6 Proposed Solutions

The following sections cover proposed solutions that this Incubator Group recommends. The proposed solutions represent the consensus of the group, except where clearly indicated that an impass was reached.

3.6.1 Protocol Proposal

TBD

3.6.2 Web Application API Proposal

Base on TTS section of the submission from Microsoft: a speech and tts proposal.

A References

IETF RFC 2119
RFC 2119: Key words for use in RFCs to Indicate Requirement Levels. Internet Engineering Task Force, 1997. (See http://www.ietf.org/rfc/rfc2119.txt.)

B Glossary

The following glossary provides brief definitions of terms that may not be familiar to readers new to the technology domain of speech processing.

ASR
barge-in
endpointer/endpointing
SLM
TTS