Purpose of this document
This document aims to summarize requirements and design decisions relevant
for the specification of a protocol supporting the communication between a User
Agent (UA) and a Speech Service (SS). The summary is based on a subset of the
requirements (FPR) and design decisions (DD) listed in the draft final report
[HTMLSPEECH].
In order to allow for a verification that the group members share a view on
what has been agreed and whether there are obvious omissions that should be
pinned down, this document attempts to group the items by aspects of the
protocol's envisaged use.
- RB: Robert Brown
- MJ: Michael Johnston
- MS: Marc Schröder
Contents
1. Relevant aspects of the interaction between UA and SS
In order to structure the collection of requirements and design decisions,
this document groups them according to the following aspects of the interaction
between UA and SS.
- UA->SS: Generic capability requests
- Recognition
- UA->SS: Initiating an ASR request
- UA->SS: Sending audio and related data for recognition
- UA->SS: Sending control commands
- SS->UA: Sending recognition results
- SS->UA: Sending relevant events
- Synthesis
- UA->SS: Initiating a TTS request and sending data for
synthesis
- UA->SS: Sending control commands
- SS->UA: Sending synthesis audio
- SS->UA: Sending relevant events
This is an ad-hoc structure which may or may not capture other group
members' understanding of the mechanism. One reason of proposing it is to
verify whether there is consensus about these aspects.
Requirements or design decisions are listed under more than one heading if
they seem to be relevant for several aspects.
2. Generic protocol-related requirements
FPR55.
Web application must be able to encrypt communications to remote speech
service.
- FPR31.
User agents and speech services may agree to use alternate protocols for
communication.
- DD8. Speech service implementations must be referenceable by URI.
- DD16. There must be no technical restriction that would prevent
implementing only TTS or only ASR. There is *mostly* agreement on this.
- DD35. We will require support for
http for all communication between the user
agent and any selected engine, including chunked http for media streaming, and support
negotiation of other protocols (such as WebSockets or whatever
RTCWeb/WebRTC comes up with).
- DD38. The scripting API communicates its parameter settings by sending
them
in the body of a POST request as Media Type "multipart". The
subtype(s) accepted (e.g., mixed, formdata) are TBD.
- DD39. If an ASR
engine allows parameters to be specified in the URI in addition to
in the POST body , when a parameter is specified in both places the one in the
body takes precedence. This
has the effect of making parameters set in the URI be treated as default
values.
- DD56. The API will support multiple simultaneous requests to speech
services (same or different, ASR and TTS).
- DD62. It must be possible to specify service-specific parameters in both
the URI and the message body. It must be clear in the API that these
parameters are service-specific, i.e., not standard.
- DD64. API must have ability to set service-specific parameters using
names that clearly identify that they are service-specific, e.g., using an
"x-" prefix. Parameter values can be arbitrary Javascript objects.
- DD69.
HTTPS must also be supported.
3. UA->SS: Generic capability requests
- FPR39.
Web application must be able to be notified when the selected language is
not available.
- FPR11.
If the web apps specify speech services, it should be possible to specify
parameters.
- DD49. The API should provide a way to determine if a service is available
before trying to use the service; this applies to the default service as
well.
- DD50. The API must provide a way to query the availability of a specific
configuration of a service.
- DD51. The API must provide a way to ask the user agent for the
capabilities of a service. In the case of private information that the user
agent may have when the default service is selected, the user agent may
choose to answer with "no comment" (or equivalent).
- DD52. Informed user consent is required for all use of private
information. This includes list of languages for ASR and voices for TTS.
When such information is requested by the web app or speech service and
permission is refused, the API must return "no comment" (or
equivalent).
4. Recognition
4.1 UA->SS: Initiating an ASR request
- FPR38.
Web application must be able to specify language of recognition.
- FPR45.
Applications should be able to specify the grammars (or lack thereof)
separately for each recognition.
- FPR34.
Web application must be able to specify domain specific custom
grammars.
- FPR48.
Web application author must be able to specify a domain specific
statistical language model.
FPR2.
Implementations must support the XML format of SRGS and must support
SISR.
FPR44.
Recognition without specifying a grammar should be possible.
- FPR58.
Web application and speech services must have a means of binding session
information to communications.
- FPR57.
Web applications must be able to request recognition based on previously
sent audio.
- DD9. It must be possible to reference ASR grammars by URI.
- DD10. It must be possible to select the ASR language using language
tags.
- DD11. It must be possible to leave the ASR grammar unspecified. Behavior
in this case is not yet defined.
- DD12. The XML format of SRGS 1.0 is mandatory to support, and it is the
only mandated grammar format. Note in particular that this means we do not
have any requirement for SLM support or SRGS ABNF support.
- DD14. SISR 1.0 support is mandatory, and it is the only mandated semantic
interpretation format.
- DD20. For grammar URIs, the "HTTP" and "data" protocol schemes must be
supported.
- DD21. A standard set of common-task grammars must be supported. The
details of what those are is TBD.
- DD36. Maxresults should be an ASR parameter representing the maximum
number of results to return.
- DD37. The user agent will use the URI for the ASR
engine exactly as specified by the web
application, including all parameters, and will not modify it to add,
remove, or change parameters.
- DD55. The API will support multiple simultaneous grammars, any
combination of allowed grammar formats. It will also support a weight on
each grammar.
- DD63. Every message from UA to speech service should send the UA-local
timestamp.
- DD72. In Javascript, speech reco requests should have an attribute for a
sequence of grammars, each of which can have properties, including weight
(and possibly language, but that is TBD).
- DD76. It must be possible to do one or more re-recognitions with any
request that you have indicated before first use that it can be
re-recognized later. This will be indicated in the API by setting a
parameter to indicate re-recognition. Any parameter can be changed,
including the speech service.
4.2 UA->SS: Sending audio and related data for
recognition
FPR25.
Implementations should be allowed to start processing captured audio before
the capture completes.
- FPR26.
The API to do recognition should not introduce unneeded latency.
FPR33.
There should be at least one mandatory-to-support codec that isn't
encumbered with IP issues and has sufficient fidelity & low bandwidth
requirements.
FPR56.
Web applications must be able to request NL interpretation based only on
text input (no audio sent).
- DD31. There are 3 classes of codecs: audio to the web-app specified ASR
engine, recognition from existing audio (e.g., local file), and audio from
the TTS engine. We need to specify a mandatory-to-support codec for
each.
- DD32. It must be possible to specify and use other codecs in addition to
those that are mandatory-to-implement.
- DD33. Support for streaming audio is required -- in particular, that ASR
may begin processing before the user has finished speaking.
- DD63. Every message from UA to speech service should send the UA-local
timestamp.
- DD67. The
protocol must send
its current timestamp to the speech service when it sends its first audio
data.
- DD75. There will be an API method for sending text input rather than
audio. There must also be a parameter to indicate how text matching should
be done, including at least "strict" and "fuzzy". Other possible ways could
be defined as vendor-specific additions.
- DD77. In the
protocol , the
client must store the audio for re-recognition. It may be possible for the
server to indicate that it also has stored the audio so it doesn't have to
be resent.
- DD80. Candidate codecs to consider are Speex, FLAC, and Ogg Vorbis, in
addition to plain old mu-law/a-law/linear PCM.
- DD83. We will not require support for video codecs. However, protocol
design must not prohibit transmission of codecs that have the same
interface requirements as audio codecs.
4.3 UA->SS: Sending control commands
- FPR59.
While capture is happening, there must be a way for the web application to
abort the capture and recognition process.
- DD46. For continuous recognition, we must support the ability to change
grammars and parameters for each chunk/frame/result
- DD63. Every message from UA to speech service should send the UA-local
timestamp.
- DD74. Bjorn's email on continuous recognition represents our decisions
regarding continuous recognition, except that there needs to be a feedback
mechanism which could result in the service sending replaces. We may refer
to "intermediate" as "partial", but naming changes such as this are TBD.
- DD76. It must be possible to do one or more re-recognitions with any
request that you have indicated before first use that it can be
re-recognized later. This will be indicated in the API by setting a
parameter to indicate re-recognition. Any parameter can be changed,
including the speech service.
- DD78. Once there is a way (defined by another group) to get access to
some blob of stored audio, we will support re-recognition of it.
4.4 SS->UA: Sending recognition results
4.5 SS->UA: Sending relevant events
5. Synthesis
5.1 UA->SS: Initiating a TTS request and sending data for
synthesis
5.2 UA->SS: Sending control commands
5.3 SS->UA: Sending synthesis audio
FPR33.
There should be at least one mandatory-to-support codec that isn't
encumbered with IP issues and has sufficient fidelity & low bandwidth
requirements.
- DD31. There are 3 classes of codecs: audio to the web-app specified ASR
engine, recognition from existing audio (e.g., local file), and audio from
the TTS engine. We need to specify a mandatory-to-support codec for
each.
- DD32. It must be possible to specify and use other codecs in addition to
those that are mandatory-to-implement.
- DD33. Support for streaming audio is required -- in particular, that ASR
may begin processing before the user has finished speaking
- DD80. Candidate codecs to consider are Speex, FLAC, and Ogg Vorbis, in
addition to plain old mu-law/a-law/linear PCM.
DD82. Protocol should support the client to begin TTS playback
before receipt of all of the audio.
5.4 SS->UA: Sending relevant events
FPR53.
The web app should be notified when the audio corresponding to a TTS
<mark> element is played back.
FPR29.
Speech synthesis implementations should be allowed to fire implementation
specific events.
- DD61. When audio corresponding to TTS mark location begins to play, a
Javascript event must be fired, and the event must contain the name of the
mark and the UA timestamp for when it was played.
- DD66. The API must support DOM 3 extension events as defined (which
basically require vendor prefixes). See
http://www.w3.org/TR/2009/WD-DOM-Level-3-Events-20090908/#extending_events-Vendor_Extensions.
It must allow the speech service to fire these events.
- DD68. It must be possible for the speech service to instruct the UA to
fire a vendor-specific event when a specific offset to audio playback start
is reached by the UA. What to do if audio is canceled, paused, etc. is
TBD.
- DD81. Protocol design should not prevent implementability of low-latency
event delivery.
- DD84. Every event from speech service to the user agent must include
timing information that the UA can convert into a UA-local timestamp. This
timing info must be for the occurrence represented by the event, not the
event time itself. For example, an end-of-speech event would contain timing
for the actual end of speech, not the time when the speech service realizes
end of speech occurred or when the event is sent.