Protocol requirements draft

Purpose of this document

This document aims to summarize requirements and design decisions relevant for the specification of a protocol supporting the communication between a User Agent (UA) and a Speech Service (SS). The summary is based on a subset of the requirements (FPR) and design decisions (DD) listed in the draft final report [HTMLSPEECH].

In order to allow for a verification that the group members share a view on what has been agreed and whether there are obvious omissions that should be pinned down, this document attempts to group the items by aspects of the protocol's envisaged use.

1. Relevant aspects of the interaction between UA and SS
2. Generic protocol-related requirements
3. UA->SS: Generic capability requests
4. Recognition
5. Synthesis
References

1. Relevant aspects of the interaction between UA and SS

In order to structure the collection of requirements and design decisions, this document groups them according to the following aspects of the interaction between UA and SS.

UA->SS: Generic capability requests
Recognition
- UA->SS: Initiating an ASR request
- UA->SS: Sending audio and related data for recognition
- UA->SS: Sending control commands
- SS->UA: Sending recognition results
- SS->UA: Sending relevant events
Synthesis
- UA->SS: Initiating a TTS request and sending data for synthesis
- UA->SS: Sending control commands
- SS->UA: Sending synthesis audio
- SS->UA: Sending relevant events

This is an ad-hoc structure which may or may not capture other group members' understanding of the mechanism. One reason of proposing it is to verify whether there is consensus about these aspects.

Requirements or design decisions are listed under more than one heading if they seem to be relevant for several aspects.

2. Generic protocol-related requirements

FPR55. Web application must be able to encrypt communications to remote speech service.
FPR31. User agents and speech services may agree to use alternate protocols for communication.
DD8. Speech service implementations must be referenceable by URI.
DD16. There must be no technical restriction that would prevent implementing only TTS or only ASR. There is *mostly* agreement on this.
DD35. We will require support for http for all communication between the user agent and any selected engine, including chunked http for media streaming, and support negotiation of other protocols (such as WebSockets or whatever RTCWeb/WebRTC comes up with).
DD38. The scripting API communicates its parameter settings by sending them in the body of a POST request as Media Type "multipart". The subtype(s) accepted (e.g., mixed, formdata) are TBD.
DD39. If an ASR engine allows parameters to be specified in the URI in addition to in the POST body, when a parameter is specified in both places the one in the body takes precedence. This has the effect of making parameters set in the URI be treated as default values.
DD56. The API will support multiple simultaneous requests to speech services (same or different, ASR and TTS).
DD62. It must be possible to specify service-specific parameters in both the URI and the message body. It must be clear in the API that these parameters are service-specific, i.e., not standard.
DD64. API must have ability to set service-specific parameters using names that clearly identify that they are service-specific, e.g., using an "x-" prefix. Parameter values can be arbitrary Javascript objects.
DD69. HTTPS must also be supported.

3. UA->SS: Generic capability requests

FPR39. Web application must be able to be notified when the selected language is not available.
FPR11. If the web apps specify speech services, it should be possible to specify parameters.
DD49. The API should provide a way to determine if a service is available before trying to use the service; this applies to the default service as well.
DD50. The API must provide a way to query the availability of a specific configuration of a service.
DD51. The API must provide a way to ask the user agent for the capabilities of a service. In the case of private information that the user agent may have when the default service is selected, the user agent may choose to answer with "no comment" (or equivalent).
DD52. Informed user consent is required for all use of private information. This includes list of languages for ASR and voices for TTS. When such information is requested by the web app or speech service and permission is refused, the API must return "no comment" (or equivalent).

4. Recognition

4.1 UA->SS: Initiating an ASR request

FPR38. Web application must be able to specify language of recognition.
FPR45. Applications should be able to specify the grammars (or lack thereof) separately for each recognition.
FPR34. Web application must be able to specify domain specific custom grammars.
FPR48. Web application author must be able to specify a domain specific statistical language model.
FPR2. Implementations must support the XML format of SRGS and must support SISR.
FPR44. Recognition without specifying a grammar should be possible.
FPR58. Web application and speech services must have a means of binding session information to communications.
FPR57. Web applications must be able to request recognition based on previously sent audio.
DD9. It must be possible to reference ASR grammars by URI.
DD10. It must be possible to select the ASR language using language tags.
DD11. It must be possible to leave the ASR grammar unspecified. Behavior in this case is not yet defined.
DD12. The XML format of SRGS 1.0 is mandatory to support, and it is the only mandated grammar format. Note in particular that this means we do not have any requirement for SLM support or SRGS ABNF support.
DD14. SISR 1.0 support is mandatory, and it is the only mandated semantic interpretation format.
DD20. For grammar URIs, the "HTTP" and "data" protocol schemes must be supported.
DD21. A standard set of common-task grammars must be supported. The details of what those are is TBD.
DD36. Maxresults should be an ASR parameter representing the maximum number of results to return.
DD37. The user agent will use the URI for the ASR engine exactly as specified by the web application, including all parameters, and will not modify it to add, remove, or change parameters.
DD55. The API will support multiple simultaneous grammars, any combination of allowed grammar formats. It will also support a weight on each grammar.
DD63. Every message from UA to speech service should send the UA-local timestamp.
DD72. In Javascript, speech reco requests should have an attribute for a sequence of grammars, each of which can have properties, including weight (and possibly language, but that is TBD).
DD76. It must be possible to do one or more re-recognitions with any request that you have indicated before first use that it can be re-recognized later. This will be indicated in the API by setting a parameter to indicate re-recognition. Any parameter can be changed, including the speech service.

4.2 UA->SS: Sending audio and related data for recognition

FPR25. Implementations should be allowed to start processing captured audio before the capture completes.
FPR26. The API to do recognition should not introduce unneeded latency.
FPR33. There should be at least one mandatory-to-support codec that isn't encumbered with IP issues and has sufficient fidelity & low bandwidth requirements.
FPR56. Web applications must be able to request NL interpretation based only on text input (no audio sent).
DD31. There are 3 classes of codecs: audio to the web-app specified ASR engine, recognition from existing audio (e.g., local file), and audio from the TTS engine. We need to specify a mandatory-to-support codec for each.
DD32. It must be possible to specify and use other codecs in addition to those that are mandatory-to-implement.
DD33. Support for streaming audio is required -- in particular, that ASR may begin processing before the user has finished speaking.
DD63. Every message from UA to speech service should send the UA-local timestamp.
DD67. The protocol must send its current timestamp to the speech service when it sends its first audio data.
DD75. There will be an API method for sending text input rather than audio. There must also be a parameter to indicate how text matching should be done, including at least "strict" and "fuzzy". Other possible ways could be defined as vendor-specific additions.
DD77. In the protocol, the client must store the audio for re-recognition. It may be possible for the server to indicate that it also has stored the audio so it doesn't have to be resent.
DD80. Candidate codecs to consider are Speex, FLAC, and Ogg Vorbis, in addition to plain old mu-law/a-law/linear PCM.
DD83. We will not require support for video codecs. However, protocol design must not prohibit transmission of codecs that have the same interface requirements as audio codecs.

4.3 UA->SS: Sending control commands

FPR59. While capture is happening, there must be a way for the web application to abort the capture and recognition process.
DD46. For continuous recognition, we must support the ability to change grammars and parameters for each chunk/frame/result
DD63. Every message from UA to speech service should send the UA-local timestamp.
DD74. Bjorn's email on continuous recognition represents our decisions regarding continuous recognition, except that there needs to be a feedback mechanism which could result in the service sending replaces. We may refer to "intermediate" as "partial", but naming changes such as this are TBD.
DD76. It must be possible to do one or more re-recognitions with any request that you have indicated before first use that it can be re-recognized later. This will be indicated in the API by setting a parameter to indicate re-recognition. Any parameter can be changed, including the speech service.
DD78. Once there is a way (defined by another group) to get access to some blob of stored audio, we will support re-recognition of it.

4.4 SS->UA: Sending recognition results

FPR4. It should be possible for the web application to get the recognition results in a standard format such as EMMA.
FPR35. Web application must be notified when speech recognition errors or non-matches occur.
FPR5. It should be easy for the web appls to get access to the most common pieces of recognition results such as utterance, confidence, and nbests.
FPR27. Speech recognition implementations should be allowed to add implementation specific information to speech recognition results.
DD18. For reco results, both the DOM representation of EMMA and the XML text representation must be provided.
DD19. For reco results, a simple Javascript representation of a list of results must be provided, with each result containing the recognized utterance, confidence score, and semantic interpretation. Note that this may need to be adjusted based on any decision regarding support for continuous recognition.
DD34. It must be possible for the recognizer to return a final result before the user is done speaking.
DD77. In the protocol, the client must store the audio for re-recognition. It may be possible for the server to indicate that it also has stored the audio so it doesn't have to be resent.
DD79. No explicit need for JSON format of EMMA, but we might use it if it existed.

4.5 SS->UA: Sending relevant events

FPR40. Web applications must be able to use barge-in (interrupting audio and TTS output when the user starts speaking).
FPR22. The web app should be notified that speech is considered to have started for the purposes of recognition.
FPR23. The web app should be notified that speech is considered to have ended for the purposes of recognition.
FPR28. Speech recognition implementations should be allowed to fire implementation specific events.
DD30. We expect to have the following six audio/speech events: onaudiostart/onaudioend, onsoundstart/onsoundend, onspeechstart/onspeechend. The onsound* events represent a "probably speech but not sure" condition, while the onspeech* events represent the recognizer being sure there's speech. The former are low latency. An end event can only occur after at least one start event of the same type has occurred. Only the user agent can generate onaudio* events, the energy detector can only generate onsound* events, and the speech service can only generate onspeech* events.
DD66. The API must support DOM 3 extension events as defined (which basically require vendor prefixes). See http://www.w3.org/TR/2009/WD-DOM-Level-3-Events-20090908/#extending_events-Vendor_Extensions. It must allow the speech service to fire these events.
DD81. Protocol design should not prevent implementability of low-latency event delivery.
DD84. Every event from speech service to the user agent must include timing information that the UA can convert into a UA-local timestamp. This timing info must be for the occurrence represented by the event, not the event time itself. For example, an end-of-speech event would contain timing for the actual end of speech, not the time when the speech service realizes end of speech occurred or when the event is sent.

5. Synthesis

5.1 UA->SS: Initiating a TTS request and sending data for synthesis

FPR3. Implementation must support SSML.
FPR46. Web apps should be able to specify which voice is used for TTS.
DD13. For TTS, SSML 1.1 is mandatory to support, as is UTF-8 plain text. These are the only mandated formats.
DD37. The user agent will use the URI for the ASR engine exactly as specified by the web application, including all parameters, and will not modify it to add, remove, or change parameters.
DD63. Every message from UA to speech service should send the UA-local timestamp.

5.2 UA->SS: Sending control commands

5.3 SS->UA: Sending synthesis audio

FPR33. There should be at least one mandatory-to-support codec that isn't encumbered with IP issues and has sufficient fidelity & low bandwidth requirements.
DD31. There are 3 classes of codecs: audio to the web-app specified ASR engine, recognition from existing audio (e.g., local file), and audio from the TTS engine. We need to specify a mandatory-to-support codec for each.
DD32. It must be possible to specify and use other codecs in addition to those that are mandatory-to-implement.
DD33. Support for streaming audio is required -- in particular, that ASR may begin processing before the user has finished speaking.
DD80. Candidate codecs to consider are Speex, FLAC, and Ogg Vorbis, in addition to plain old mu-law/a-law/linear PCM.
DD82. Protocol should support the client to begin TTS playback before receipt of all of the audio.

5.4 SS->UA: Sending relevant events

FPR53. The web app should be notified when the audio corresponding to a TTS <mark> element is played back.
FPR29. Speech synthesis implementations should be allowed to fire implementation specific events.
DD61. When audio corresponding to TTS mark location begins to play, a Javascript event must be fired, and the event must contain the name of the mark and the UA timestamp for when it was played.
DD66. The API must support DOM 3 extension events as defined (which basically require vendor prefixes). See http://www.w3.org/TR/2009/WD-DOM-Level-3-Events-20090908/#extending_events-Vendor_Extensions. It must allow the speech service to fire these events.
DD68. It must be possible for the speech service to instruct the UA to fire a vendor-specific event when a specific offset to audio playback start is reached by the UA. What to do if audio is canceled, paused, etc. is TBD.
DD81. Protocol design should not prevent implementability of low-latency event delivery.
DD84. Every event from speech service to the user agent must include timing information that the UA can convert into a UA-local timestamp. This timing info must be for the occurrence represented by the event, not the event time itself. For example, an end-of-speech event would contain timing for the actual end of speech, not the time when the speech service realizes end of speech occurred or when the event is sent.

References

[HTMLSPEECH]: Bodell et al., eds.: HTML Speech Incubator Group Final Report (Internal Draft). http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech-20110607.html

HTML Speech XG: Protocol-related requirements and design decisions