Re: Collection of req's and design decisions relevant for protocol discussion from JOHNSTON, MICHAEL J (MICHAEL J) on 2011-06-09 (public-xg-htmlspeech@w3.org from June 2011)

From: JOHNSTON, MICHAEL J (MICHAEL J) <johnston@research.att.com>
Date: Thu, 9 Jun 2011 12:29:58 -0400
To: Robert Brown <Robert.Brown@microsoft.com>
CC: Marc Schroeder <marc.schroeder@dfki.de>, "Milan Young (Nuance)" <Milan.Young@nuance.com>, "Satish Sampath (Google)" <satish@google.com>, "Glen Shires (gshires@google.com)" <gshires@google.com>, "EHLEN, PATRICK (ATTSI)" <pehlen@attinteractive.com>, HTML Speech XG <public-xg-htmlspeech@w3.org>, "Dan Burnett (Voxeo)" <dburnett@voxeo.com>, Michael Bodell <mbodell@microsoft.com>
Message-ID: <B076F07A-16E6-48B7-9800-9DE7590C4032@research.att.com>

MJ: One we need to rework is DD35:

DD35. We will require support for http for all communication between the user agent and any selected engine, including chunked http for media streaming, and support negotiation of other protocols (such as WebSockets or whatever RTCWeb/WebRTC comes up with).

MJ: As we are moving away from simple HTTP to support the full set of use cases,

MJ: Suggested DD35 rewrite

DD35. We will require support for WebSockets for communication between the user agent and any selected engine, including
chunked audio for media streaming.

MJ: Question remains if we also need to support negotiation of other protocols ...

On Jun 8, 2011, at 8:01 PM, Robert Brown wrote:

Thanks again Marc. This is a very thorough compilation. Exactly what I was hoping for.
(And yes, the way it's structured makes good sense.)

In reading through it, I think there are a number of requirements or design agreements that will need to be clarified or adjusted. Some are obvious (in some cases "engine" becomes "service", "ASR" becomes "ASR + TTS" and "protocol" becomes "UA").

But others probably need rewrites that we'll need to get group consensus on. I've supplied suggested rewrites for discussion.

---
DD16. There must be no technical restriction that would prevent implementing only TTS or only ASR. There is *mostly* agreement on this.

RB: The corollary to this is that a service MAY implement only TTS or only SR, or both.

---
DD38. The scripting API communicates its parameter settings by sending them in the body of a POST request as Media Type "multipart". The subtype(s) accepted (e.g., mixed, formdata) are TBD.
DD39. If an ASR engine allows parameters to be specified in the URI in addition to in the POST body, when a parameter is specified in both places the one in the body takes precedence. This has the effect of making parameters set in the URI be treated as default values.

RB: These imply HTTP, and pre-date the discussion of continuous speech. Also, DD39 talks about engines rather than services, and doesn't mention TTS. They should be rewritten.

RB: Suggested DD38 rewrite: The scripting API communicates its parameter settings by sending them as typed content in the protocol.

RB: Suggested DD39 rewrite: If an ASR or TTS service allows parameters to be specified in the URI in addition being transported as content, when a parameter is specified in both places the one in the content takes precedence. This has the effect of making parameters set in the URI be treated as default values.

---
DD69. HTTPS must also be supported.

RB: Should be written as "It MUST be possible to use an encrypted protocol."

---
DD67. The protocol must send its current timestamp to the speech service when it sends its first audio data.

RB: Should say "The UA must...". I'm also nerveous that it's premature to have made this decision. I'd prefer that we say "The protocol and UA must communicate sufficient timing information for the UA to determine the precise local timestamp for each service-generated event."

---
DD75. There will be an API method for sending text input rather than audio. There must also be a parameter to indicate how text matching should be done, including at least "strict" and "fuzzy". Other possible ways could be defined as vendor-specific additions.

RB: Unless we can specify exactly what "strict" means (and I don't think we can), I'd prefer wording like: "There will be an API method for sending text input rather than audio, resulting in a match or nomatch event as if the text had actually been spoken. The precise algorithm for performing the match is at the discretion of the ASR service, and may optionally be modified by service-specific parameters".

---
DD77. In the protocol, the client must store the audio for re-recognition. It may be possible for the server to indicate that it also has stored the audio so it doesn't have to be resent.

RB: Should say "In the UA,..."

---
DD80. Candidate codecs to consider are Speex, FLAC, and Ogg Vorbis, in addition to plain old mu-law/a-law/linear PCM.

RB: Opus has also been mentioned a few times.

---
DD74. Bjorn's email on continuous recognition represents our decisions regarding continuous recognition, except that there needs to be a feedback mechanism which could result in the service sending replaces. We may refer to "intermediate" as "partial", but naming changes such as this are TBD.

RB: We need a clearer definition of the "feedback mechanism", since it will need to be represented in the protocol.

---
DD37. The user agent will use the URI for the ASR engine exactly as specified by the web application, including all parameters, and will not modify it to add, remove, or change parameters.

RB: should be "ASR or TTS service".

---
DD84. Every event from speech service to the user agent must include timing information that the UA can convert into a UA-local timestamp. This timing info must be for the occurrence represented by the event, not the event time itself. For example, an end-of-speech event would contain timing for the actual end of speech, not the time when the speech service realizes end of speech occurred or when the event is sent.

RB: This is written from the ASR point of view. TTS has a slightly different requirement. TTS timing should be expressed as an offset from the beginning of the render stream, since the UA can play any portion of the rendered audio at any time.

RB: Also, for ASR, we may need to clarify that this requirement also applies to re-reco. Even if the UA re-sends the audio stream, the base timestamp should be the same as the original transmission.

Received on Thursday, 9 June 2011 16:30:41 UTC