3.4 Security

Both the signaling and media transmission aspects of the html-speech/1.0 protocol inherit a number of security features from the underlying WebSockets protocol[WS-PROTOCOL]:

Server Authentication: Clients may authenticate servers using standard TLS, simply by using the WSS: uri scheme rather than the WS: scheme in the service URI. This is standard WebSockets functionality in much the same way as HTTP specifies TLS by using the HTTPS: scheme.
Encryption: Similarly, all traffic (media and signaling) is encrypted by TLS, when using the WSS: uri scheme.
User Authentication: User authentication, when required by a server, will commonly be done using the standard [HTTP] challenge-response mechanism in the initial websocket bootstrap. A server may also choose to use TLS client authentication, and although this will probably be uncommon, WebSockets stacks should support it.

HTML speech network scenarios also have security boundaries outside of signaling and media:

Transitive Access to Resources

Clients may request servers to access resources (SRGS documents, SSML documents, audio files, etc) from a third location. This may either be a result of the application referring to them by URI, or of an already loaded resource containing a URI reference to a separate resource. The server will need permission to access these resources.

In some cases, this may be accomplished out-of-band by the developer, administrator, or provisioning system, setting up appropriate permissions and identities beforehand.

In some cases, this may be accomplished by the use of short-lived URIs that are only valid for a limited period. Again, the use of this technique is implemented out-of-band from the html-speech/1.0 protocol.

In some cases, the client may have a cookie containing a secret that is used to authorize access to the resource. In this case, the cookie may be passed to the speech server using cookie headers in a request. The server would then use this cookie when accessing resources required by that request.

Access to Retained Media

Through the use of certain headers, the client may request the server to retain a recording of the input media, and make this recording available at a URL. The server that holds the recording MAY secure this recording by using standard HTTP security mechanisms. It MAY authenticate client using standard HTTP challenge/response. It MAY use TLS to encrypt the recording when transmitting it back to the client, as well as having the option to use TLS to authenticate the client. The server that holds a recording may also discard a recording after a reasonable period, as determined by the server.

Addendum to 5.1's LISTEN request

When there are no input media streams, and the Input-Waveform-URI header has not been specified, the recognizer cannot enter the listening state, and the listen request will fail (4xx). When in the listening state, and all input streams have ended, the recognizer automatically transitions to the idle state, and issues a RECOGNITION-COMPLETE event, with Completion-Cause set to 480 ("no-input-stream").

Addendum to References

TLS: RFC 5246: The Transport Layer Security (TLS) Protocol Version 1.2 http://tools.ietf.org/html/rfc5246