W3C

HTML Speech XG: Protocol-related requirements and design decisions

Editor: Marc Schröder, DFKI

Status: Work in progress / Input to discussion in protocol subgroup

Date: 8 June 2011


Purpose of this document

This document aims to summarize requirements and design decisions relevant for the specification of a protocol supporting the communication between a User Agent (UA) and a Speech Service (SS). The summary is based on a subset of the requirements (FPR) and design decisions (DD) listed in the draft final report [HTMLSPEECH].

In order to allow for a verification that the group members share a view on what has been agreed and whether there are obvious omissions that should be pinned down, this document attempts to group the items by aspects of the protocol's envisaged use.

Contents

1. Relevant aspects of the interaction between UA and SS

In order to structure the collection of requirements and design decisions, this document groups them according to the following aspects of the interaction between UA and SS.

  • UA->SS: Generic capability requests
  • Recognition
    • UA->SS: Initiating an ASR request
    • UA->SS: Sending audio and related data for recognition
    • UA->SS: Sending control commands
    • SS->UA: Sending recognition results
    • SS->UA: Sending relevant events
  • Synthesis
    • UA->SS: Initiating a TTS request and sending data for synthesis
    • UA->SS: Sending control commands
    • SS->UA: Sending synthesis audio
    • SS->UA: Sending relevant events

This is an ad-hoc structure which may or may not capture other group members' understanding of the mechanism. One reason of proposing it is to verify whether there is consensus about these aspects.

Requirements or design decisions are listed under more than one heading if they seem to be relevant for several aspects.

2. Generic protocol-related requirements

  • FPR55. Web application must be able to encrypt communications to remote speech service.
  • FPR31. User agents and speech services may agree to use alternate protocols for communication.
  • DD8. Speech service implementations must be referenceable by URI.
  • DD16. There must be no technical restriction that would prevent implementing only TTS or only ASR. There is *mostly* agreement on this.
  • DD35. We will require support for http for all communication between the user agent and any selected engine, including chunked http for media streaming, and support negotiation of other protocols (such as WebSockets or whatever RTCWeb/WebRTC comes up with).
  • DD38. The scripting API communicates its parameter settings by sending them in the body of a POST request as Media Type "multipart". The subtype(s) accepted (e.g., mixed, formdata) are TBD.
  • DD39. If an ASR engine allows parameters to be specified in the URI in addition to in the POST body, when a parameter is specified in both places the one in the body takes precedence. This has the effect of making parameters set in the URI be treated as default values.
  • DD56. The API will support multiple simultaneous requests to speech services (same or different, ASR and TTS).
  • DD62. It must be possible to specify service-specific parameters in both the URI and the message body. It must be clear in the API that these parameters are service-specific, i.e., not standard.
  • DD64. API must have ability to set service-specific parameters using names that clearly identify that they are service-specific, e.g., using an "x-" prefix. Parameter values can be arbitrary Javascript objects.
  • DD69. HTTPS must also be supported.

3. UA->SS: Generic capability requests

  • FPR39. Web application must be able to be notified when the selected language is not available.
  • FPR11. If the web apps specify speech services, it should be possible to specify parameters.
  • DD49. The API should provide a way to determine if a service is available before trying to use the service; this applies to the default service as well.
  • DD50. The API must provide a way to query the availability of a specific configuration of a service.
  • DD51. The API must provide a way to ask the user agent for the capabilities of a service. In the case of private information that the user agent may have when the default service is selected, the user agent may choose to answer with "no comment" (or equivalent).
  • DD52. Informed user consent is required for all use of private information. This includes list of languages for ASR and voices for TTS. When such information is requested by the web app or speech service and permission is refused, the API must return "no comment" (or equivalent).

4. Recognition

4.1 UA->SS: Initiating an ASR request

4.2 UA->SS: Sending audio and related data for recognition

4.3 UA->SS: Sending control commands

  • FPR59. While capture is happening, there must be a way for the web application to abort the capture and recognition process.
  • DD46. For continuous recognition, we must support the ability to change grammars and parameters for each chunk/frame/result
  • DD63. Every message from UA to speech service should send the UA-local timestamp.
  • DD74. Bjorn's email on continuous recognition represents our decisions regarding continuous recognition, except that there needs to be a feedback mechanism which could result in the service sending replaces. We may refer to "intermediate" as "partial", but naming changes such as this are TBD.
  • DD76. It must be possible to do one or more re-recognitions with any request that you have indicated before first use that it can be re-recognized later. This will be indicated in the API by setting a parameter to indicate re-recognition. Any parameter can be changed, including the speech service.
  • DD78. Once there is a way (defined by another group) to get access to some blob of stored audio, we will support re-recognition of it.

4.4 SS->UA: Sending recognition results

4.5 SS->UA: Sending relevant events

5. Synthesis

5.1 UA->SS: Initiating a TTS request and sending data for synthesis

5.2 UA->SS: Sending control commands

5.3 SS->UA: Sending synthesis audio

  • FPR33. There should be at least one mandatory-to-support codec that isn't encumbered with IP issues and has sufficient fidelity & low bandwidth requirements.
  • DD31. There are 3 classes of codecs: audio to the web-app specified ASR engine, recognition from existing audio (e.g., local file), and audio from the TTS engine. We need to specify a mandatory-to-support codec for each.
  • DD32. It must be possible to specify and use other codecs in addition to those that are mandatory-to-implement.
  • DD33. Support for streaming audio is required -- in particular, that ASR may begin processing before the user has finished speaking.
  • DD80. Candidate codecs to consider are Speex, FLAC, and Ogg Vorbis, in addition to plain old mu-law/a-law/linear PCM.
  • DD82. Protocol should support the client to begin TTS playback before receipt of all of the audio.

5.4 SS->UA: Sending relevant events

  • FPR53. The web app should be notified when the audio corresponding to a TTS <mark> element is played back.
  • FPR29. Speech synthesis implementations should be allowed to fire implementation specific events.
  • DD61. When audio corresponding to TTS mark location begins to play, a Javascript event must be fired, and the event must contain the name of the mark and the UA timestamp for when it was played.
  • DD66. The API must support DOM 3 extension events as defined (which basically require vendor prefixes). See http://www.w3.org/TR/2009/WD-DOM-Level-3-Events-20090908/#extending_events-Vendor_Extensions. It must allow the speech service to fire these events.
  • DD68. It must be possible for the speech service to instruct the UA to fire a vendor-specific event when a specific offset to audio playback start is reached by the UA. What to do if audio is canceled, paused, etc. is TBD.
  • DD81. Protocol design should not prevent implementability of low-latency event delivery.
  • DD84. Every event from speech service to the user agent must include timing information that the UA can convert into a UA-local timestamp. This timing info must be for the occurrence represented by the event, not the event time itself. For example, an end-of-speech event would contain timing for the actual end of speech, not the time when the speech service realizes end of speech occurred or when the event is sent.

References

[HTMLSPEECH]
Bodell et al., eds.: HTML Speech Incubator Group Final Report (Internal Draft). http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech-20110607.html