Thoughts on RTP in the HTML Speech protocol from Robert Brown on 2011-06-14 (public-xg-htmlspeech@w3.org from June 2011)

From: Robert Brown <Robert.Brown@microsoft.com>
Date: Tue, 14 Jun 2011 01:23:49 +0000
To: "Milan Young (Nuance)" <Milan.Young@nuance.com>, "Satish Sampath (Google)" <satish@google.com>, "Glen Shires (gshires@google.com)" <gshires@google.com>, "Patrick Ehlen (AT&T)" <pehlen@attinteractive.com>, "Dan Burnett (Voxeo)" <dburnett@voxeo.com>, "Michael Johnston (AT&T)" <johnston@research.att.com>, "Marc Schroeder (DFKI)" <marc.schroeder@dfki.de>
CC: HTML Speech XG <public-xg-htmlspeech@w3.org>, Michael Bodell <mbodell@microsoft.com>
Message-ID: <113BCF28740AF44989BE7D3F84AE18DD1B135206@TK5EX14MBXC118.redmond.corp.microsoft.>

Protocol folks,

In last week's call it was suggested that we should also consider RTP (http://www.ietf.org/rfc/rfc3550.txt), given discussions happening in other groups. It's a good suggestion, and if we choose something other than RTP, we'll no doubt need to defend this position.

Here are some thoughts on the implications of using RTP instead of the simpler approach (http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-0008/speech-protocol-basic-approach-01.html.)

One potential objection to using RTP for HTML Speech is that our requirements are quite different from RTP's typical use case. RTP's typically used for telephone-like applications: strictly metered, low-latency, real-time transmission of audio between multiple humans engaged in a live conversation. UDP is used because packet loss is an acceptable trade-off in order to maintain the necessary low-latency to maintain the illusion of being in the same room. It's often accompanied by RTCP, which helps monitor QOS and provides a participant roster for multi-party calls. HTML Speech, on the other hand, just needs to send audio over a WebSocket between one human and one machine as fast as it can. In some cases, this will be faster than real time (e.g. rendered audio from TTS). In others it'll be much slower than real-time (e.g. on slow networks), but that's perfectly okay for many apps. In all cases, packet-loss is an undesirable trade-off, because it affects SR accuracy and TTS fidelity.

In theory, these differences don't matter. Although RTP it is allowed to be transported by a protocol other than UDP, and in applications other than telephony. Indeed, the RFC states "RTP may be used with other suitable underlying network or transport protocols", and "While RTP is primarily designed to satisfy the needs of multi-participant multimedia conferences, it is not limited to that particular application". So we *could* just send each RTP message in the body of a WebSockets binary message, and we wouldn't be violating the *spirit* of the protocol.

However, another more serious objection is that RTP is a more complicated design than we need. For the requirements we've identified, aside from the encoded media, our header requirements are minimal (a mux-channel number, and possibly a timestamp). However, RTP has a number of header fields that, although we don't currently know of any use for them in the HTML Speech scenarios, implementers would need to support to some degree. For example: optional padding, an optional header extension, the optional designation of multiple sources where the media stream was selected from one of those sources, and an optional marker bit. Implementations would need to decide what to do with each of these. It may be okay to just throw an error if any of the optional settings are used. But then one has to ask what additional benefit we're getting from this protocol. There's also a sequence number in each message, which is useful if you're using UDP, where packets may be dropped or delivered out of order, but is redundant if transported over WebSockets, which has guaranteed in-order delivery. Multiplexing is also problematic. RTP has no multiplexing, and relies on the underlying transport to do this, i.e. the IP & port of the UDP packet. There's no equivalent mechanism in WebSockets, which has no native support for multiplexing. One option is to use RTP's SSRC field (synchronization source): since each stream has its own SSRC, this should serve to mux the streams. The RFC specifically states: "Separate audio and video streams SHOULD NOT be carried in a single RTP session and demultiplexed based on the payload type or SSRC fields." However, the reasons given for this are either outside the scope of our scenarios (mixing), or not applicable because we're using a transport layered over TCP rather than UDP (i.e. guaranteed in-order delivery). The other option is to insert a channel number ahead of each RTP packet, at which point the RTP header becomes completely redundant.

Bottom line: Although RTP would work, it would result in added complexity for no apparent benefit. Our requirements are apparently much simpler than the problems RTP solves.

What do you think? Am I missing something?

Thx.

Received on Tuesday, 14 June 2011 01:24:31 UTC