Protocol requirements draft

Purpose of this document

This document aims to summarize requirements and design decisions relevant for the specification of a protocol supporting the communication between a User Agent (UA) and a Speech Service (SS). The summary is based on a subset of the requirements (FPR) and design decisions (DD) listed in the draft final report [HTMLSPEECH].

In order to allow for a verification that the group members share a view on what has been agreed and whether there are obvious omissions that should be pinned down, this document attempts to group the items by aspects of the protocol's envisaged use.

Comments are typeset like this paragraph, and prefixed with the initials of the commentator:

RB: Robert Brown
MJ: Michael Johnston
MS: Marc Schröder

1. Relevant aspects of the interaction between UA and SS
2. Generic protocol-related requirements
3. UA->SS: Generic capability requests
4. Recognition
5. Synthesis
References

1. Relevant aspects of the interaction between UA and SS

In order to structure the collection of requirements and design decisions, this document groups them according to the following aspects of the interaction between UA and SS.

UA->SS: Generic capability requests
Recognition
- UA->SS: Initiating an ASR request
- UA->SS: Sending audio and related data for recognition
- UA->SS: Sending control commands
- SS->UA: Sending recognition results
- SS->UA: Sending relevant events
Synthesis
- UA->SS: Initiating a TTS request and sending data for synthesis
- UA->SS: Sending control commands
- SS->UA: Sending synthesis audio
- SS->UA: Sending relevant events

This is an ad-hoc structure which may or may not capture other group members' understanding of the mechanism. One reason of proposing it is to verify whether there is consensus about these aspects.

Requirements or design decisions are listed under more than one heading if they seem to be relevant for several aspects.

2. Generic protocol-related requirements

~~FPR55. Web application must be able to encrypt communications to remote speech service.~~ MS: Redundant, covered by DD69
FPR31. User agents and speech services may agree to use alternate protocols for communication.MS: See discussion of DD35.
DD8. Speech service implementations must be referenceable by URI.
DD16. There must be no technical restriction that would prevent implementing only TTS or only ASR. There is *mostly* agreement on this. RB: The corollary to this is that a service MAY implement only TTS or only SR, or both.
MJ: One we need to rework is DD35: As we are moving away from simple HTTP to support the full set of use cases, Suggested DD35 rewrite:
DD35. We will require support for ~~http~~ WebSockets for all communication between the user agent and any selected engine, including chunked ~~http~~ audio for media streaming~~, and support negotiation of other protocols (such as WebSockets or whatever RTCWeb/WebRTC comes up with)~~. MJ: Question remains if we also need to support negotiation of other protocols ...
RB: [DD38 and 39] imply HTTP, and pre-date the discussion of continuous speech. Also, DD39 talks about engines rather than services, and doesn't mention TTS. They should be rewritten. [Suggested rewrites inline:]
DD38. The scripting API communicates its parameter settings by sending them ~~in the body of a POST request as Media Type "multipart". The subtype(s) accepted (e.g., mixed, formdata) are TBD.~~ as typed content in the protocol.
DD39. If an ASR ~~engine~~ or TTS service allows parameters to be specified in the URI in addition to ~~in the POST body~~ being transported as content, when a parameter is specified in both places the one in the ~~body~~ content takes precedence. This has the effect of making parameters set in the URI be treated as default values.
DD56. The API will support multiple simultaneous requests to speech services (same or different, ASR and TTS).
DD62. It must be possible to specify service-specific parameters in both the URI and the message body. It must be clear in the API that these parameters are service-specific, i.e., not standard.
DD64. API must have ability to set service-specific parameters using names that clearly identify that they are service-specific, e.g., using an "x-" prefix. Parameter values can be arbitrary Javascript objects.
DD69. ~~HTTPS must also be supported.~~ RB: Should be written as "It MUST be possible to use an encrypted protocol."

3. UA->SS: Generic capability requests

FPR39. Web application must be able to be notified when the selected language is not available.
FPR11. If the web apps specify speech services, it should be possible to specify parameters.
DD49. The API should provide a way to determine if a service is available before trying to use the service; this applies to the default service as well.
DD50. The API must provide a way to query the availability of a specific configuration of a service.
DD51. The API must provide a way to ask the user agent for the capabilities of a service. In the case of private information that the user agent may have when the default service is selected, the user agent may choose to answer with "no comment" (or equivalent).
DD52. Informed user consent is required for all use of private information. This includes list of languages for ASR and voices for TTS. When such information is requested by the web app or speech service and permission is refused, the API must return "no comment" (or equivalent).

4. Recognition

4.1 UA->SS: Initiating an ASR request

FPR38. Web application must be able to specify language of recognition.
FPR45. Applications should be able to specify the grammars (or lack thereof) separately for each recognition.
FPR34. Web application must be able to specify domain specific custom grammars.
FPR48. Web application author must be able to specify a domain specific statistical language model. MS: Is this in conflict with DD12? Do we allow for SLMs or not?
~~FPR2. Implementations must support the XML format of SRGS and must support SISR.~~ MS: Redundant, replaced by DD12 and DD14.
~~FPR44. Recognition without specifying a grammar should be possible.~~ MS: Redundant, see DD11.
FPR58. Web application and speech services must have a means of binding session information to communications.
FPR57. Web applications must be able to request recognition based on previously sent audio.
DD9. It must be possible to reference ASR grammars by URI.
DD10. It must be possible to select the ASR language using language tags.
DD11. It must be possible to leave the ASR grammar unspecified. Behavior in this case is not yet defined.
DD12. The XML format of SRGS 1.0 is mandatory to support, and it is the only mandated grammar format. Note in particular that this means we do not have any requirement for SLM support or SRGS ABNF support.
DD14. SISR 1.0 support is mandatory, and it is the only mandated semantic interpretation format.
DD20. For grammar URIs, the "HTTP" and "data" protocol schemes must be supported.
DD21. A standard set of common-task grammars must be supported. The details of what those are is TBD.
DD36. Maxresults should be an ASR parameter representing the maximum number of results to return.
DD37. The user agent will use the URI for the ASR ~~engine~~ or TTS service exactly as specified by the web application, including all parameters, and will not modify it to add, remove, or change parameters.
DD55. The API will support multiple simultaneous grammars, any combination of allowed grammar formats. It will also support a weight on each grammar.
DD63. Every message from UA to speech service should send the UA-local timestamp. MS: Seems overly specific at this point, see also Robert's comment on DD67 below.
DD72. In Javascript, speech reco requests should have an attribute for a sequence of grammars, each of which can have properties, including weight (and possibly language, but that is TBD).
DD76. It must be possible to do one or more re-recognitions with any request that you have indicated before first use that it can be re-recognized later. This will be indicated in the API by setting a parameter to indicate re-recognition. Any parameter can be changed, including the speech service.

4.2 UA->SS: Sending audio and related data for recognition

~~FPR25. Implementations should be allowed to start processing captured audio before the capture completes.~~ MS: Redundant, see DD33.
FPR26. The API to do recognition should not introduce unneeded latency.
~~FPR33. There should be at least one mandatory-to-support codec that isn't encumbered with IP issues and has sufficient fidelity & low bandwidth requirements.~~ MS: Seems redundant, now that we have DD31, DD32, DD80, and DD83.
~~FPR56. Web applications must be able to request NL interpretation based only on text input (no audio sent).~~ MS: Redundant with DD75.
DD31. There are 3 classes of codecs: audio to the web-app specified ASR engine, recognition from existing audio (e.g., local file), and audio from the TTS engine. We need to specify a mandatory-to-support codec for each.
DD32. It must be possible to specify and use other codecs in addition to those that are mandatory-to-implement.
DD33. Support for streaming audio is required -- in particular, that ASR may begin processing before the user has finished speaking.
DD63. Every message from UA to speech service should send the UA-local timestamp.
DD67. The ~~protocol~~ UA must send its current timestamp to the speech service when it sends its first audio data.RB: Should say "The UA must...". I'm also nerveous that it's premature to have made this decision. I'd prefer that we say "The protocol and UA must communicate sufficient timing information for the UA to determine the precise local timestamp for each service-generated event."
DD75. There will be an API method for sending text input rather than audio. There must also be a parameter to indicate how text matching should be done, including at least "strict" and "fuzzy". Other possible ways could be defined as vendor-specific additions. RB: Unless we can specify exactly what "strict" means (and I don't think we can), I'd prefer wording like: "There will be an API method for sending text input rather than audio, resulting in a match or nomatch event as if the text had actually been spoken. The precise algorithm for performing the match is at the discretion of the ASR service, and may optionally be modified by service-specific parameters".
DD77. In the ~~protocol~~ UA, the client must store the audio for re-recognition. It may be possible for the server to indicate that it also has stored the audio so it doesn't have to be resent.
DD80. Candidate codecs to consider are Speex, FLAC, and Ogg Vorbis, in addition to plain old mu-law/a-law/linear PCM. RB: Opus has also been mentioned a few times.
DD83. We will not require support for video codecs. However, protocol design must not prohibit transmission of codecs that have the same interface requirements as audio codecs.

4.3 UA->SS: Sending control commands

FPR59. While capture is happening, there must be a way for the web application to abort the capture and recognition process.
DD46. For continuous recognition, we must support the ability to change grammars and parameters for each chunk/frame/result
DD63. Every message from UA to speech service should send the UA-local timestamp.
DD74. Bjorn's email on continuous recognition represents our decisions regarding continuous recognition, except that there needs to be a feedback mechanism which could result in the service sending replaces. We may refer to "intermediate" as "partial", but naming changes such as this are TBD. RB: We need a clearer definition of the "feedback mechanism", since it will need to be represented in the protocol.
DD76. It must be possible to do one or more re-recognitions with any request that you have indicated before first use that it can be re-recognized later. This will be indicated in the API by setting a parameter to indicate re-recognition. Any parameter can be changed, including the speech service.
DD78. Once there is a way (defined by another group) to get access to some blob of stored audio, we will support re-recognition of it.

4.4 SS->UA: Sending recognition results

~~FPR4. It should be possible for the web application to get the recognition results in a standard format such as EMMA.~~ MS: DD18 and DD19 are more specific.
FPR35. Web application must be notified when speech recognition errors or non-matches occur.
~~FPR5. It should be easy for the web appls to get access to the most common pieces of recognition results such as utterance, confidence, and nbests.~~ MS: Seems redundant with DD19, except for nbest.
FPR27. Speech recognition implementations should be allowed to add implementation specific information to speech recognition results.
DD18. For reco results, both the DOM representation of EMMA and the XML text representation must be provided.
DD19. For reco results, a simple Javascript representation of a list of results must be provided, with each result containing the recognized utterance, confidence score, and semantic interpretation. Note that this may need to be adjusted based on any decision regarding support for continuous recognition.
DD34. It must be possible for the recognizer to return a final result before the user is done speaking.
DD77. In the ~~protocol~~ UA, the client must store the audio for re-recognition. It may be possible for the server to indicate that it also has stored the audio so it doesn't have to be resent.
DD79. No explicit need for JSON format of EMMA, but we might use it if it existed.

4.5 SS->UA: Sending relevant events

FPR40. Web applications must be able to use barge-in (interrupting audio and TTS output when the user starts speaking).
~~FPR22. The web app should be notified that speech is considered to have started for the purposes of recognition.~~ MS: Superseded by the more concrete DD30.
~~FPR23. The web app should be notified that speech is considered to have ended for the purposes of recognition.~~ MS: Superseded by the more concrete DD30.
~~FPR28. Speech recognition implementations should be allowed to fire implementation specific events.~~ MS: Redundant with DD66.
DD30. We expect to have the following six audio/speech events: onaudiostart/onaudioend, onsoundstart/onsoundend, onspeechstart/onspeechend. The onsound* events represent a "probably speech but not sure" condition, while the onspeech* events represent the recognizer being sure there's speech. The former are low latency. An end event can only occur after at least one start event of the same type has occurred. Only the user agent can generate onaudio* events, the energy detector can only generate onsound* events, and the speech service can only generate onspeech* events.
DD66. The API must support DOM 3 extension events as defined (which basically require vendor prefixes). See http://www.w3.org/TR/2009/WD-DOM-Level-3-Events-20090908/#extending_events-Vendor_Extensions. It must allow the speech service to fire these events.
DD81. Protocol design should not prevent implementability of low-latency event delivery.
DD84. Every event from speech service to the user agent must include timing information that the UA can convert into a UA-local timestamp. This timing info must be for the occurrence represented by the event, not the event time itself. For example, an end-of-speech event would contain timing for the actual end of speech, not the time when the speech service realizes end of speech occurred or when the event is sent. RB: for ASR, we may need to clarify that this requirement also applies to re-reco. Even if the UA re-sends the audio stream, the base timestamp should be the same as the original transmission.

5. Synthesis

5.1 UA->SS: Initiating a TTS request and sending data for synthesis

~~FPR3. Implementation must support SSML.~~ MS: Redundant with DD13.
FPR46. Web apps should be able to specify which voice is used for TTS.
DD13. For TTS, SSML 1.1 is mandatory to support, as is UTF-8 plain text. These are the only mandated formats.
DD37. The user agent will use the URI for the ASR ~~engine~~ or TTS service exactly as specified by the web application, including all parameters, and will not modify it to add, remove, or change parameters.
DD63. Every message from UA to speech service should send the UA-local timestamp.

5.2 UA->SS: Sending control commands

5.3 SS->UA: Sending synthesis audio

~~FPR33. There should be at least one mandatory-to-support codec that isn't encumbered with IP issues and has sufficient fidelity & low bandwidth requirements.~~ MS: Redundant with DD31, DD32, DD80.
DD31. There are 3 classes of codecs: audio to the web-app specified ASR engine, recognition from existing audio (e.g., local file), and audio from the TTS engine. We need to specify a mandatory-to-support codec for each.
DD32. It must be possible to specify and use other codecs in addition to those that are mandatory-to-implement.
DD33. Support for streaming audio is required -- in particular, that ASR may begin processing before the user has finished speaking MS: and that TTS can begin playback before receipt of all of the audio.
DD80. Candidate codecs to consider are Speex, FLAC, and Ogg Vorbis, in addition to plain old mu-law/a-law/linear PCM. RB: Opus has also been mentioned a few times.
~~DD82. Protocol should support the client to begin TTS playback before receipt of all of the audio.~~ MS: Suggest to merge into DD33, see above.

5.4 SS->UA: Sending relevant events

~~FPR53. The web app should be notified when the audio corresponding to a TTS <mark> element is played back.~~ MS: Redundant with DD61.
~~FPR29. Speech synthesis implementations should be allowed to fire implementation specific events.~~ MS: Redundant with DD66.
DD61. When audio corresponding to TTS mark location begins to play, a Javascript event must be fired, and the event must contain the name of the mark and the UA timestamp for when it was played.
DD66. The API must support DOM 3 extension events as defined (which basically require vendor prefixes). See http://www.w3.org/TR/2009/WD-DOM-Level-3-Events-20090908/#extending_events-Vendor_Extensions. It must allow the speech service to fire these events.
DD68. It must be possible for the speech service to instruct the UA to fire a vendor-specific event when a specific offset to audio playback start is reached by the UA. What to do if audio is canceled, paused, etc. is TBD.
DD81. Protocol design should not prevent implementability of low-latency event delivery.
DD84. Every event from speech service to the user agent must include timing information that the UA can convert into a UA-local timestamp. This timing info must be for the occurrence represented by the event, not the event time itself. For example, an end-of-speech event would contain timing for the actual end of speech, not the time when the speech service realizes end of speech occurred or when the event is sent. RB: This is written from the ASR point of view. TTS has a slightly different requirement. TTS timing should be expressed as an offset from the beginning of the render stream, since the UA can play any portion of the rendered audio at any time.

References

[HTMLSPEECH]: Bodell et al., eds.: HTML Speech Incubator Group Final Report (Internal Draft). http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech-20110607.html

HTML Speech XG: Protocol-related requirements and design decisions