W3C

HTML Speech XG: Protocol-related requirements and design decisions

Editor: Marc Schröder, DFKI

Including comments from Robert Brown, Microsoft and Michael Johnston, AT&T

Status: Work in progress / Ongoing discussion in protocol subgroup

Date: 14 June 2011

Previous version: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0011/protocol-reqs.html


Purpose of this document

This document aims to summarize requirements and design decisions relevant for the specification of a protocol supporting the communication between a User Agent (UA) and a Speech Service (SS). The summary is based on a subset of the requirements (FPR) and design decisions (DD) listed in the draft final report [HTMLSPEECH].

In order to allow for a verification that the group members share a view on what has been agreed and whether there are obvious omissions that should be pinned down, this document attempts to group the items by aspects of the protocol's envisaged use.

Comments are typeset like this paragraph, and prefixed with the initials of the commentator:

  • RB: Robert Brown
  • MJ: Michael Johnston
  • MS: Marc Schröder

Contents

1. Relevant aspects of the interaction between UA and SS

In order to structure the collection of requirements and design decisions, this document groups them according to the following aspects of the interaction between UA and SS.

  • UA->SS: Generic capability requests
  • Recognition
    • UA->SS: Initiating an ASR request
    • UA->SS: Sending audio and related data for recognition
    • UA->SS: Sending control commands
    • SS->UA: Sending recognition results
    • SS->UA: Sending relevant events
  • Synthesis
    • UA->SS: Initiating a TTS request and sending data for synthesis
    • UA->SS: Sending control commands
    • SS->UA: Sending synthesis audio
    • SS->UA: Sending relevant events

This is an ad-hoc structure which may or may not capture other group members' understanding of the mechanism. One reason of proposing it is to verify whether there is consensus about these aspects.

Requirements or design decisions are listed under more than one heading if they seem to be relevant for several aspects.

2. Generic protocol-related requirements

  • FPR55. Web application must be able to encrypt communications to remote speech service. MS: Redundant, covered by DD69
  • FPR31. User agents and speech services may agree to use alternate protocols for communication.MS: See discussion of DD35.
  • DD8. Speech service implementations must be referenceable by URI.
  • DD16. There must be no technical restriction that would prevent implementing only TTS or only ASR. There is *mostly* agreement on this. RB: The corollary to this is that a service MAY implement only TTS or only SR, or both.

    MJ: One we need to rework is DD35: As we are moving away from simple HTTP to support the full set of use cases, Suggested DD35 rewrite:

  • DD35. We will require support for http WebSockets for all communication between the user agent and any selected engine, including chunked http audio for media streaming, and support negotiation of other protocols (such as WebSockets or whatever RTCWeb/WebRTC comes up with). MJ: Question remains if we also need to support negotiation of other protocols ...

    RB: [DD38 and 39] imply HTTP, and pre-date the discussion of continuous speech. Also, DD39 talks about engines rather than services, and doesn't mention TTS. They should be rewritten. [Suggested rewrites inline:]

  • DD38. The scripting API communicates its parameter settings by sending them in the body of a POST request as Media Type "multipart". The subtype(s) accepted (e.g., mixed, formdata) are TBD. as typed content in the protocol.
  • DD39. If an ASR engine or TTS service allows parameters to be specified in the URI in addition to in the POST body being transported as content, when a parameter is specified in both places the one in the body content takes precedence. This has the effect of making parameters set in the URI be treated as default values.
  • DD56. The API will support multiple simultaneous requests to speech services (same or different, ASR and TTS).
  • DD62. It must be possible to specify service-specific parameters in both the URI and the message body. It must be clear in the API that these parameters are service-specific, i.e., not standard.
  • DD64. API must have ability to set service-specific parameters using names that clearly identify that they are service-specific, e.g., using an "x-" prefix. Parameter values can be arbitrary Javascript objects.
  • DD69. HTTPS must also be supported. RB: Should be written as "It MUST be possible to use an encrypted protocol."

3. UA->SS: Generic capability requests

  • FPR39. Web application must be able to be notified when the selected language is not available.
  • FPR11. If the web apps specify speech services, it should be possible to specify parameters.
  • DD49. The API should provide a way to determine if a service is available before trying to use the service; this applies to the default service as well.
  • DD50. The API must provide a way to query the availability of a specific configuration of a service.
  • DD51. The API must provide a way to ask the user agent for the capabilities of a service. In the case of private information that the user agent may have when the default service is selected, the user agent may choose to answer with "no comment" (or equivalent).
  • DD52. Informed user consent is required for all use of private information. This includes list of languages for ASR and voices for TTS. When such information is requested by the web app or speech service and permission is refused, the API must return "no comment" (or equivalent).

4. Recognition

4.1 UA->SS: Initiating an ASR request

4.2 UA->SS: Sending audio and related data for recognition

  • FPR25. Implementations should be allowed to start processing captured audio before the capture completes. MS: Redundant, see DD33.
  • FPR26. The API to do recognition should not introduce unneeded latency.
  • FPR33. There should be at least one mandatory-to-support codec that isn't encumbered with IP issues and has sufficient fidelity & low bandwidth requirements. MS: Seems redundant, now that we have DD31, DD32, DD80, and DD83.
  • FPR56. Web applications must be able to request NL interpretation based only on text input (no audio sent). MS: Redundant with DD75.
  • DD31. There are 3 classes of codecs: audio to the web-app specified ASR engine, recognition from existing audio (e.g., local file), and audio from the TTS engine. We need to specify a mandatory-to-support codec for each.
  • DD32. It must be possible to specify and use other codecs in addition to those that are mandatory-to-implement.
  • DD33. Support for streaming audio is required -- in particular, that ASR may begin processing before the user has finished speaking.
  • DD63. Every message from UA to speech service should send the UA-local timestamp.
  • DD67. The protocol UA must send its current timestamp to the speech service when it sends its first audio data.RB: Should say "The UA must...". I'm also nerveous that it's premature to have made this decision. I'd prefer that we say "The protocol and UA must communicate sufficient timing information for the UA to determine the precise local timestamp for each service-generated event."
  • DD75. There will be an API method for sending text input rather than audio. There must also be a parameter to indicate how text matching should be done, including at least "strict" and "fuzzy". Other possible ways could be defined as vendor-specific additions. RB: Unless we can specify exactly what "strict" means (and I don't think we can), I'd prefer wording like: "There will be an API method for sending text input rather than audio, resulting in a match or nomatch event as if the text had actually been spoken. The precise algorithm for performing the match is at the discretion of the ASR service, and may optionally be modified by service-specific parameters".
  • DD77. In the protocol UA, the client must store the audio for re-recognition. It may be possible for the server to indicate that it also has stored the audio so it doesn't have to be resent.
  • DD80. Candidate codecs to consider are Speex, FLAC, and Ogg Vorbis, in addition to plain old mu-law/a-law/linear PCM. RB: Opus has also been mentioned a few times.
  • DD83. We will not require support for video codecs. However, protocol design must not prohibit transmission of codecs that have the same interface requirements as audio codecs.

4.3 UA->SS: Sending control commands

  • FPR59. While capture is happening, there must be a way for the web application to abort the capture and recognition process.
  • DD46. For continuous recognition, we must support the ability to change grammars and parameters for each chunk/frame/result
  • DD63. Every message from UA to speech service should send the UA-local timestamp.
  • DD74. Bjorn's email on continuous recognition represents our decisions regarding continuous recognition, except that there needs to be a feedback mechanism which could result in the service sending replaces. We may refer to "intermediate" as "partial", but naming changes such as this are TBD. RB: We need a clearer definition of the "feedback mechanism", since it will need to be represented in the protocol.
  • DD76. It must be possible to do one or more re-recognitions with any request that you have indicated before first use that it can be re-recognized later. This will be indicated in the API by setting a parameter to indicate re-recognition. Any parameter can be changed, including the speech service.
  • DD78. Once there is a way (defined by another group) to get access to some blob of stored audio, we will support re-recognition of it.

4.4 SS->UA: Sending recognition results

4.5 SS->UA: Sending relevant events

5. Synthesis

5.1 UA->SS: Initiating a TTS request and sending data for synthesis

5.2 UA->SS: Sending control commands

5.3 SS->UA: Sending synthesis audio

  • FPR33. There should be at least one mandatory-to-support codec that isn't encumbered with IP issues and has sufficient fidelity & low bandwidth requirements. MS: Redundant with DD31, DD32, DD80.
  • DD31. There are 3 classes of codecs: audio to the web-app specified ASR engine, recognition from existing audio (e.g., local file), and audio from the TTS engine. We need to specify a mandatory-to-support codec for each.
  • DD32. It must be possible to specify and use other codecs in addition to those that are mandatory-to-implement.
  • DD33. Support for streaming audio is required -- in particular, that ASR may begin processing before the user has finished speaking MS: and that TTS can begin playback before receipt of all of the audio.
  • DD80. Candidate codecs to consider are Speex, FLAC, and Ogg Vorbis, in addition to plain old mu-law/a-law/linear PCM. RB: Opus has also been mentioned a few times.
  • DD82. Protocol should support the client to begin TTS playback before receipt of all of the audio. MS: Suggest to merge into DD33, see above.

5.4 SS->UA: Sending relevant events

  • FPR53. The web app should be notified when the audio corresponding to a TTS <mark> element is played back. MS: Redundant with DD61.
  • FPR29. Speech synthesis implementations should be allowed to fire implementation specific events. MS: Redundant with DD66.
  • DD61. When audio corresponding to TTS mark location begins to play, a Javascript event must be fired, and the event must contain the name of the mark and the UA timestamp for when it was played.
  • DD66. The API must support DOM 3 extension events as defined (which basically require vendor prefixes). See http://www.w3.org/TR/2009/WD-DOM-Level-3-Events-20090908/#extending_events-Vendor_Extensions. It must allow the speech service to fire these events.
  • DD68. It must be possible for the speech service to instruct the UA to fire a vendor-specific event when a specific offset to audio playback start is reached by the UA. What to do if audio is canceled, paused, etc. is TBD.
  • DD81. Protocol design should not prevent implementability of low-latency event delivery.
  • DD84. Every event from speech service to the user agent must include timing information that the UA can convert into a UA-local timestamp. This timing info must be for the occurrence represented by the event, not the event time itself. For example, an end-of-speech event would contain timing for the actual end of speech, not the time when the speech service realizes end of speech occurred or when the event is sent. RB: This is written from the ASR point of view. TTS has a slightly different requirement. TTS timing should be expressed as an offset from the beginning of the render stream, since the UA can play any portion of the rendered audio at any time.

References

[HTMLSPEECH]
Bodell et al., eds.: HTML Speech Incubator Group Final Report (Internal Draft). http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech-20110607.html