W3C

HTML Speech XG: Protocol-related requirements and design decisions

Editor: Marc Schröder, DFKI

Including comments from Robert Brown, Microsoft and Michael Johnston, AT&T

Status: Work in progress / Ongoing discussion in protocol subgroup

Review and comparison to draft protocol document

Date: 13 July 2011

Previous version: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0011/protocol-reqs.html


Purpose of this document

This document aims to summarize requirements and design decisions relevant for the specification of a protocol supporting the communication between a User Agent (UA) and a Speech Service (SS). The summary is based on a subset of the requirements (FPR) and design decisions (DD) listed in the draft final report [HTMLSPEECH].

In order to allow for a verification that the group members share a view on what has been agreed and whether there are obvious omissions that should be pinned down, this document attempts to group the items by aspects of the protocol's envisaged use.

Comments are typeset like this paragraph, and prefixed with the initials of the commentator:

  • RB: Robert Brown
  • MJ: Michael Johnston
  • MS: Marc Schröder

Contents

1. Relevant aspects of the interaction between UA and SS

In order to structure the collection of requirements and design decisions, this document groups them according to the following aspects of the interaction between UA and SS.

  • UA->SS: Generic capability requests
  • Recognition
    • UA->SS: Initiating an ASR request
    • UA->SS: Sending audio and related data for recognition
    • UA->SS: Sending control commands
    • SS->UA: Sending recognition results
    • SS->UA: Sending relevant events
  • Synthesis
    • UA->SS: Initiating a TTS request and sending data for synthesis
    • UA->SS: Sending control commands
    • SS->UA: Sending synthesis audio
    • SS->UA: Sending relevant events

This is an ad-hoc structure which may or may not capture other group members' understanding of the mechanism. One reason of proposing it is to verify whether there is consensus about these aspects.

Requirements or design decisions are listed under more than one heading if they seem to be relevant for several aspects.

2. Generic protocol-related requirements

  • FPR55. Web application must be able to encrypt communications to remote speech service. MS: Redundant, covered by DD69
  • FPR31. User agents and speech services may agree to use alternate protocols for communication.MS: See discussion of DD35.
  • DD8. Speech service implementations must be referenceable by URI. PE: Covered in 3.1.
  • DD16. There must be no technical restriction that would prevent implementing only TTS or only ASR. There is *mostly* agreement on this. RB: The corollary to this is that a service MAY implement only TTS or only SR, or both. PE: Covered. No restrictions on this imposed or implied.

    MJ: One we need to rework is DD35: As we are moving away from simple HTTP to support the full set of use cases, Suggested DD35 rewrite:

  • DD35. We will require support for http WebSockets for all communication between the user agent and any selected engine, including chunked http audio for media streaming, and support negotiation of other protocols (such as WebSockets or whatever RTCWeb/WebRTC comes up with). MJ: Question remains if we also need to support negotiation of other protocols ...

    RB: [DD38 and 39] imply HTTP, and pre-date the discussion of continuous speech. Also, DD39 talks about engines rather than services, and doesn't mention TTS. They should be rewritten. [Suggested rewrites inline:]

  • DD38. The scripting API communicates its parameter settings by sending them in the body of a POST request as Media Type "multipart". The subtype(s) accepted (e.g., mixed, formdata) are TBD. as typed content in the protocol. MJ: Need to clarify, can't be in the POST body, but parameters can either be in the service URI or show up in headers of various methods e.g. SET-PARAMS, LISTEN, ...
  • DD39. If an ASR engine or TTS service allows parameters to be specified in the URI in addition to in the POST body being transported as content, when a parameter is specified in both places the one in the body content takes precedence. This has the effect of making parameters set in the URI be treated as default values. MJ: Covered, this is stated in 3.1
  • DD56. The API will support multiple simultaneous requests to speech services (same or different, ASR and TTS). MJ: This is more for the API than for the protocol. PE: Per 2.2 streams support different Request-IDs so nothing blocking this in protocol.
  • DD62. It must be possible to specify service-specific parameters in both the URI and the message body. It must be clear in the API that these parameters are service-specific, i.e., not standard. MJ: Conflict here, See 3.1, parameter can be in the URI or headers but not the body. PE: Also, not many details on SET_PARAMS are provided.
  • DD64. API must have ability to set service-specific parameters using names that clearly identify that they are service-specific, e.g., using an "x-" prefix. Parameter values can be arbitrary Javascript objects. PE: We have custom vendor resource under 3.2.1 and vendor-listen-mode under 5.3. Presumably other custom params can be set by SET_PARAMS?MJ: Any issues pushing 'arbitrary javascript objects' over the protocol.
  • DD69. HTTPS must also be supported. RB: Should be written as "It MUST be possible to use an encrypted protocol." PE: Not discussed in the protocol doc.

3. UA->SS: Generic capability requests

  • FPR39. Web application must be able to be notified when the selected language is not available. PE: Covered in 4.1.
  • FPR11. If the web apps specify speech services, it should be possible to specify parameters. PE: Covered in 4.1
  • DD49. The API should provide a way to determine if a service is available before trying to use the service; this applies to the default service as well. PE: Can specify a service in 3.2.1 and query params in 4.1, but nothing on specifying a service.MJ: See DD50 below.
  • DD50. The API must provide a way to query the availability of a specific configuration of a service. MJ: We have material relevant to this in 4.1, we have Supported-Languages: and Supported-Media as ways to check for configuration, I guess this can also be used to check if recognition vs synthesis is available. Do we also need headers for checking the availability of a particular grammar? or a particular class of grammar e.g SRGS, or some vendor specific format, or to check if a built in grammar is present? Are there other configuration aspects we need to be able to check for? PE: Might also be good to be able to specify packages as sets of parameters.
  • DD51. The API must provide a way to ask the user agent for the capabilities of a service. In the case of private information that the user agent may have when the default service is selected, the user agent may choose to answer with "no comment" (or equivalent). PE: Not really covered by GET_PARAMS in 4.1.
  • DD52. Informed user consent is required for all use of private information. This includes list of languages for ASR and voices for TTS. When such information is requested by the web app or speech service and permission is refused, the API must return "no comment" (or equivalent). PE: Not covered.

4. Recognition

4.1 UA->SS: Initiating an ASR request

  • FPR38. Web application must be able to specify language of recognition. MJ Covered, we have the reco-heard 'Speech-Language' in the protocol. Presumably language could also be specified in an SRGS grammar.
  • FPR45. Applications should be able to specify the grammars (or lack thereof) separately for each recognition. MJ Covered, beyond this we can even change grammars during the course of streaming audio.
  • FPR34. Web application must be able to specify domain specific custom grammars. MJ Covered, in the case of a remote speech resource, specific SRGS or SLMS can be specified. In the case of a default recognizer, only SRGS.
  • FPR48. Web application author must be able to specify a domain specific statistical language model. MS: Is this in conflict with DD12? Do we allow for SLMs or not? MJ: Covered, assuming they are using a remote speech resource.
  • FPR2. Implementations must support the XML format of SRGS and must support SISR. MS: Redundant, replaced by DD12 and DD14.
  • FPR44. Recognition without specifying a grammar should be possible. MS: Redundant, see DD11.
  • FPR58. Web application and speech services must have a means of binding session information to communications. MJ: Need to clarify.
  • FPR57. Web applications must be able to request recognition based on previously sent audio. MJ: * Not clear that we have this yet in the protocol. Need a way to refer to audio stored on the server and request re-recognition. We do have the Save-Waveform header, need to work through example.
  • DD9. It must be possible to reference ASR grammars by URI. MJ: Covered, grammar-activate headers take URI values
  • DD10. It must be possible to select the ASR language using language tags. MJ: Language attributes could be used in the SRGS grammar. Does the protocol need to say anything about this. * What is the relation ship between language set using Speech-Language vs language set using in grammar tags? Speech-Language is a reco header, rather than defined per grammar. Will we support multiple active grammars in different languages. In that case, would Speech-Language not be specified but the tags would appear in the grammars. What about the case where there are multiple SLMs each for a different language? See also http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Nov/0106.html
  • DD11. It must be possible to leave the ASR grammar unspecified. Behavior in this case is not yet defined. MJ: Speech-Language headers and in grammar tags (xml:lang on grammar)are both optional. What would behavior be, of course this has an impact on the dictionary to be used in compiling the LM.
  • DD12. The XML format of SRGS 1.0 is mandatory to support, and it is the only mandated grammar format. Note in particular that this means we do not have any requirement for SLM support or SRGS ABNF support. MJ: * We need to revise this, SRGS is the only format mandated to be supported by both default and specified speech services. But specified speech services will often support SLMs and could support ABNF.
  • DD14. SISR 1.0 support is mandatory, and it is the only mandated semantic interpretation format. MJ: Covered
  • DD20. For grammar URIs, the "HTTP" and "data" protocol schemes must be supported. MJ: We just specify URI, do we need to say anything specific about HTTP and data?
  • DD21. A standard set of common-task grammars must be supported. The details of what those are is TBD. MJ: This is in conflict with the protocol doc section 5.4, which specifies that speech services MAY support pre-defined builtin grammars. Should instead this be a requirement on default speech services rather then remote services accessed through the protocol?
  • DD36. Maxresults should be an ASR parameter representing the maximum number of results to return. MJ: We have the N-Best-List-Length header in 5.3 is this sufficient, presumably if you set it to 10 and there only 2 you get 2, and if there are 12 results you get 10. Generally the number available are influenced by other parameters, beamwidth, limits of the number of active arcs etc. Do we need those or are they too vendor specific?
  • DD37. The user agent will use the URI for the ASR engine or TTS service exactly as specified by the web application, including all parameters, and will not modify it to add, remove, or change parameters. MJ: This seems quite URI specific, we are also able to specify
  • parameters in the protocol using SET-PARAMS.
  • DD55. The API will support multiple simultaneous grammars, any combination of allowed grammar formats. It will also support a weight on each grammar. MJ: Covered in 5.3, we have the ability to specify in the grammar-activate header a weight to associate with each URI, and multiple URIs can be specified. Assume working out what type of compiled language model or grammar it is comes from examination of the URI or what it points to. In some cases, grammars will already be compiled, in others they will be built on request. DEFINE-GRAMMAR can be used for compile requests. Presumably it is also used for loading grammars which have already been compiled. Is this required?
  • DD63. Every message from UA to speech service should send the UA-local timestamp. MS: Seems overly specific at this point, see also Robert's comment on DD67 below. MJ: Agree, do we really need to send the local timestamp with every single message?
  • DD72. In Javascript, speech reco requests should have an attribute for a sequence of grammars, each of which can have properties, including weight (and possibly language, but that is TBD). MJ: This relates to D55 above and DD10. As of now we can specify a rulename and weight for each grammar. Do we also need to specify language or any other per grammar parameters?
  • DD76. It must be possible to do one or more re-recognitions with any request that you have indicated before first use that it can be re-recognized later. This will be indicated in the API by setting a parameter to indicate re-recognition. Any parameter can be changed, including the speech service. MJ: This relates to FPR57 above. Do we an header in the recognition protocol that can be used to say keep this around so we can ask you to recognize it again.Case of re-reco with the same service is the only that impacts the protocol. How long does it need to be kept around, for the duration of the websocket connection, for the duration of the web session. When you make a new websocket connection and refer to the ID of a previously stored audio there is no guarantee that you end up on the same recognition server, so this then requires some common storage for holding audio for possible re-recognition. Need to make sure if can only be re-recognized by request from the same client. In the case of switching service then the audio will need to be stored locally by the UA and made available for future recognition requests.

4.2 UA->SS: Sending audio and related data for recognition

  • FPR25. Implementations should be allowed to start processing captured audio before the capture completes. MS: Redundant, see DD33.
  • FPR26. The API to do recognition should not introduce unneeded latency.
  • FPR33. There should be at least one mandatory-to-support codec that isn't encumbered with IP issues and has sufficient fidelity & low bandwidth requirements. MS: Seems redundant, now that we have DD31, DD32, DD80, and DD83.
  • FPR56. Web applications must be able to request NL interpretation based only on text input (no audio sent). MS: Redundant with DD75.
  • DD31. There are 3 classes of codecs: audio to the web-app specified ASR engine, recognition from existing audio (e.g., local file), and audio from the TTS engine. We need to specify a mandatory-to-support codec for each. MJ: * We describe media transmission 3.3 but do not yet indicate mandatory to support codecs.
  • DD32. It must be possible to specify and use other codecs in addition to those that are mandatory-to-implement. MJ: Right now it is fairly open, doesnt really limit the media that are sent.
  • DD33. Support for streaming audio is required -- in particular, that ASR may begin processing before the user has finished speaking. MJ; Covered, protocol is designed for streaming audio over websocket.
  • DD63. Every message from UA to speech service should send the UA-local timestamp. MJ: Do we need local time on all messages?
  • DD67. The protocol UA must send its current timestamp to the speech service when it sends its first audio data.RB: Should say "The UA must...". I'm also nerveous that it's premature to have made this decision. I'd prefer that we say "The protocol and UA must communicate sufficient timing information for the UA to determine the precise local timestamp for each service-generated event."
  • DD75. There will be an API method for sending text input rather than audio. There must also be a parameter to indicate how text matching should be done, including at least "strict" and "fuzzy". Other possible ways could be defined as vendor-specific additions. RB: Unless we can specify exactly what "strict" means (and I don't think we can), I'd prefer wording like: "There will be an API method for sending text input rather than audio, resulting in a match or nomatch event as if the text had actually been spoken. The precise algorithm for performing the match is at the discretion of the ASR service, and may optionally be modified by service-specific parameters". MJ: Covered, INTERPRET with the Interpret-Text header in 5.1, 5.3. Agree with RB not to do strict vs fuzzy, strict is actually easier at least with respect to SRGS i.e. string must be included in the SRGS grammar, but how to define fuzzy is less clear.
  • DD77. In the protocol UA, the client must store the audio for re-recognition. It may be possible for the server to indicate that it also has stored the audio so it doesn't have to be resent. MJ: Not covered, do we need some kind of recognition event to indicate that the audio has been stored? or can this be specified in END-OF-INPUT or RECOGNITION-COMPLETE?
  • DD80. Candidate codecs to consider are Speex, FLAC, and Ogg Vorbis, in addition to plain old mu-law/a-law/linear PCM. RB: Opus has also been mentioned a few times. MJ: If we can't agree on anything else should we settle on plain old linear PCM?
  • DD83. We will not require support for video codecs. However, protocol design must not prohibit transmission of codecs that have the same interface requirements as audio codecs. MJ: Covered, Media stream description as written should not prevent video.

4.3 UA->SS: Sending control commands

  • FPR59. While capture is happening, there must be a way for the web application to abort the capture and recognition process. MJ: Covered, protocol does not prevent this. UA would send 0x03 End-of-stream, then STOP.
  • DD46. For continuous recognition, we must support the ability to change grammars and parameters for each chunk/frame/result MJ: Covered, this can be achieved by doing a START-MEDIA-STREAM sending the media, doing an end-of-stream, doing whatever parameter changes,then starting to send media again? Need to clarify relationship between START-MEDIA-STREAM and LISTEN, if everything is over one connection, how can be start sending media, and then send a LISTEN, do we do end of stream, then a LISTEN, then start the stream again?
  • DD63. Every message from UA to speech service should send the UA-local timestamp. MJ: Do we need local time on all messages?
  • DD74. Bjorn's email on continuous recognition represents our decisions regarding continuous recognition, except that there needs to be a feedback mechanism which could result in the service sending replaces. We may refer to "intermediate" as "partial", but naming changes such as this are TBD. RB: We need a clearer definition of the "feedback mechanism", since it will need to be represented in the protocol. MJ: We have an INTERMEDIATE-RESULT message, but we dont have any kind of scheme for indicating the relationships between intermediate results and the final result.
  • DD76. It must be possible to do one or more re-recognitions with any request that you have indicated before first use that it can be re-recognized later. This will be indicated in the API by setting a parameter to indicate re-recognition. Any parameter can be changed, including the speech service. MJ: This relates to FPR57 above. Do we an header in the recognition protocol that can be used to say keep this around so we can ask you to recognize it again.Case of re-reco with the same service is the only that impacts the protocol. How long does it need to be kept around, for the duration of the websocket connection, for the duration of the web session. When you make a new websocket connection and refer to the ID of a previously stored audio there is no guarantee that you end up on the same recognition server, so this then requires some common storage for holding audio for possible re-recognition. Need to make sure if can only be re-recognized by request from the same client. In the case of switching service then the audio will need to be stored locally by the UA and made available for future recognition requests.
  • DD78. Once there is a way (defined by another group) to get access to some blob of stored audio, we will support re-recognition of it. MJ; Does this impact the protocol, we don't say anything about where the media stream comes from? UA could be reading it from said blob.

4.4 SS->UA: Sending recognition results

4.5 SS->UA: Sending relevant events

  • FPR40. Web applications must be able to use barge-in (interrupting audio and TTS output when the user starts speaking).
  • FPR22. The web app should be notified that speech is considered to have started for the purposes of recognition. MS: Superseded by the more concrete DD30.
  • FPR23. The web app should be notified that speech is considered to have ended for the purposes of recognition. MS: Superseded by the more concrete DD30.
  • FPR28. Speech recognition implementations should be allowed to fire implementation specific events. MS: Redundant with DD66.
  • DD30. We expect to have the following six audio/speech events: onaudiostart/onaudioend, onsoundstart/onsoundend, onspeechstart/onspeechend. The onsound* events represent a "probably speech but not sure" condition, while the onspeech* events represent the recognizer being sure there's speech. The former are low latency. An end event can only occur after at least one start event of the same type has occurred. Only the user agent can generate onaudio* events, the energy detector can only generate onsound* events, and the speech service can only generate onspeech* events. MJ: The only one for the protocol to deal with is onspeech, since that comes from the remote service. The relevant recognition events are START-OF-INPUT and END-OF-INPUT. Should we rename to more closely match the JS API events, so it is clear we are talking about speech detection, not the start of the stream?
  • DD66. The API must support DOM 3 extension events as defined (which basically require vendor prefixes). See http://www.w3.org/TR/2009/WD-DOM-Level-3-Events-20090908/#extending_events-Vendor_Extensions. It must allow the speech service to fire these events. MJ: * Does the protocol support DOM3 extension events fired from the server?
  • DD81. Protocol design should not prevent implementability of low-latency event delivery. MJ: Covered
  • DD84. Every event from speech service to the user agent must include timing information that the UA can convert into a UA-local timestamp. This timing info must be for the occurrence represented by the event, not the event time itself. For example, an end-of-speech event would contain timing for the actual end of speech, not the time when the speech service realizes end of speech occurred or when the event is sent. RB: for ASR, we may need to clarify that this requirement also applies to re-reco. Even if the UA re-sends the audio stream, the base timestamp should be the same as the original transmission. MJ: Speech service would have to use local UA timestamp for the beginning of the stream, and when it does END-OF-INPUT calculate the time with respect to that UA timestamp. We need to work through all the timestamp issues in detail.

5. Synthesis

5.1 UA->SS: Initiating a TTS request and sending data for synthesis

  • FPR3. Implementation must support SSML. MS: Redundant with DD13. PE: Covered in 6.4.
  • FPR46. Web apps should be able to specify which voice is used for TTS. PE: Covered in 6.3.
  • DD13. For TTS, SSML 1.1 is mandatory to support, as is UTF-8 plain text. These are the only mandated formats. PE: SSML stated under 2.0 Synthesizer point 1. UTF-8 support never explicitly stated.
  • DD37. The user agent will use the URI for the ASR engine or TTS service exactly as specified by the web application, including all parameters, and will not modify it to add, remove, or change parameters. PE: Presumably covered, though I guess nothing in the protocol prevents it.
  • DD63. Every message from UA to speech service should send the UA-local timestamp. PE: We have relative timestamps, but not UA-local.

5.2 UA->SS: Sending control commands

5.3 SS->UA: Sending synthesis audio

  • FPR33. There should be at least one mandatory-to-support codec that isn't encumbered with IP issues and has sufficient fidelity & low bandwidth requirements. MS: Redundant with DD31, DD32, DD80. PE: We haven't specified any default codec.
  • DD31. There are 3 classes of codecs: audio to the web-app specified ASR engine, recognition from existing audio (e.g., local file), and audio from the TTS engine. We need to specify a mandatory-to-support codec for each. PE: No mandatory codecs specified.
  • DD32. It must be possible to specify and use other codecs in addition to those that are mandatory-to-implement. PE: Covered in 6.3.
  • DD33. Support for streaming audio is required -- in particular, that ASR may begin processing before the user has finished speaking MS: and that TTS can begin playback before receipt of all of the audio. PE: Covered in 6.3/6.4.
  • DD80. Candidate codecs to consider are Speex, FLAC, and Ogg Vorbis, in addition to plain old mu-law/a-law/linear PCM. RB: Opus has also been mentioned a few times.
  • DD82. Protocol should support the client to begin TTS playback before receipt of all of the audio. MS: Suggest to merge into DD33, see above. PE: Covered in 6.3/6.4.

5.4 SS->UA: Sending relevant events

  • FPR53. The web app should be notified when the audio corresponding to a TTS <mark> element is played back. MS: Redundant with DD61. PE: Covered in 6.2 by "SPEECH-MARKER"
  • FPR29. Speech synthesis implementations should be allowed to fire implementation specific events. MS: Redundant with DD66. PE: Covered in 6.2 by "SPEECH-MARKER"
  • DD61. When audio corresponding to TTS mark location begins to play, a Javascript event must be fired, and the event must contain the name of the mark and the UA timestamp for when it was played. MJ: **Looks like our current mechanism for marks is to respond to SPEAK with IN PROGRESS, send binary audio, stop the media stream from TTS 0x03, then do a SPEECH-MARKER event with IN-PROGRESS on it, then continue sending audio. Looks like this is covered.
  • DD66. The API must support DOM 3 extension events as defined (which basically require vendor prefixes). See http://www.w3.org/TR/2009/WD-DOM-Level-3-Events-20090908/#extending_events-Vendor_Extensions. It must allow the speech service to fire these events. PE: More of an API req?
  • DD68. It must be possible for the speech service to instruct the UA to fire a vendor-specific event when a specific offset to audio playback start is reached by the UA. What to do if audio is canceled, paused, etc. is TBD. MJ: **Not clear this is covered, would the vendor specific event be contained within the SPEECH-MARKER
    PE: ...or within INTERIM-EVENT?
  • DD81. Protocol design should not prevent implementability of low-latency event delivery. PE: Covered in 6.2.
  • DD84. Every event from speech service to the user agent must include timing information that the UA can convert into a UA-local timestamp. This timing info must be for the occurrence represented by the event, not the event time itself. For example, an end-of-speech event would contain timing for the actual end of speech, not the time when the speech service realizes end of speech occurred or when the event is sent. RB: This is written from the ASR point of view. TTS has a slightly different requirement. TTS timing should be expressed as an offset from the beginning of the render stream, since the UA can play any portion of the rendered audio at any time. PE: Covered in 6.2 by "COMPLETE" Speech-Marker:timestamp=NNNN, or is that a relative timestamp? Don't see UA-local time passed anywhere.

References

[HTMLSPEECH]
Bodell et al., eds.: HTML Speech Incubator Group Final Report (Internal Draft). http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech-20110607.html