Both the signaling and media transmission aspects of the html-speech/1.0 protocol inherit a number of security features from the underlying WebSockets protocol[WS-PROTOCOL]:
Clients may authenticate servers using standard TLS, simply by using the WSS: uri scheme rather than the WS: scheme in the service URI. This is standard WebSockets functionality in much the same way as HTTP specifies TLS by using the HTTPS: scheme.
Similarly, all traffic (media and signaling) is encrypted by TLS, when using the WSS: uri scheme.
User authentication, when required by a server, will commonly be done using the standard [HTTP] challenge-response mechanism in the initial websocket bootstrap. A server may also choose to use TLS client authentication, and although this will probably be uncommon, WebSockets stacks should support it.
HTML speech network scenarios also have security boundaries outside of signaling and media:
Clients may request servers to access resources (SRGS documents, SSML documents, audio files, etc) from a third location. This may either be a result of the application referring to them by URI, or of an already loaded resource containing a URI reference to a separate resource. The server will need permission to access these resources.
In some cases, this may be accomplished out-of-band by the developer, administrator, or provisioning system, setting up appropriate permissions and identities beforehand.
In some cases, this may be accomplished by the use of short-lived URIs that are only valid for a limited period. Again, the use of this technique is implemented out-of-band from the html-speech/1.0 protocol.
In some cases, the client may have a cookie containing a secret that is used to authorize access to the resource. In this case, the cookie may be passed to the speech server using cookie headers in a request. The server would then use this cookie when accessing resources required by that request.
When there are no input media streams, and the Input-Waveform-URI header has not been specified, the recognizer cannot enter the listening state, and the listen request will fail (4xx). When in the listening state, and all input streams have ended, the recognizer automatically transitions to the idle state, and issues a RECOGNITION-COMPLETE event, with Completion-Cause set to 480 ("no-input-stream").