RE: Thoughts on RTP in the HTML Speech protocol from Young, Milan on 2011-06-14 (public-xg-htmlspeech@w3.org from June 2011)

From: Young, Milan <Milan.Young@nuance.com>
Date: Tue, 14 Jun 2011 08:22:48 -0700
To: Robert Brown <Robert.Brown@microsoft.com>, "Satish Sampath (Google)" <satish@google.com>, <gshires@google.com>, "Patrick Ehlen (AT&T)" <pehlen@attinteractive.com>, "Dan Burnett (Voxeo)" <dburnett@voxeo.com>, "Michael Johnston (AT&T)" <johnston@research.att.com>, "Marc Schroeder (DFKI)" <marc.schroeder@dfki.de>
CC: HTML Speech XG <public-xg-htmlspeech@w3.org>, Michael Bodell <mbodell@microsoft.com>
Message-ID: <1AA381D92997964F898DF2A3AA4FF9AD0B8C9B5A@SUN-EXCH01.nuance.com>

These arguments makes sense to me, and I agree with your conclusion.

 

I see timeline as another argument in favor of the "basic approach".
This is the observation that native RTP from within the browser is new
technology whereas WebSockets are relatively more mature.  (By "native"
RTP I meant a direct-connect approach, not RTP within a WebSocket or
such.  I agree with Robert that RTP within WebSocket doesn't make sense
for our use case.)

 

The one counter-argument I see in favor of RTP is that it's an
established protocol implemented by the MRCP engines.  Inventing our own
audio streaming scheme, no matter how simple, will probably generate
more aggregate work.

 

 

 

________________________________

From: Robert Brown [mailto:Robert.Brown@microsoft.com] 
Sent: Monday, June 13, 2011 6:24 PM
To: Young, Milan; Satish Sampath (Google); Glen Shires
(gshires@google.com); Patrick Ehlen (AT&T); Dan Burnett (Voxeo); Michael
Johnston (AT&T); Marc Schroeder (DFKI)
Cc: HTML Speech XG; Michael Bodell
Subject: Thoughts on RTP in the HTML Speech protocol

 

Protocol folks, 

 

In last week's call it was suggested that we should also consider RTP
(http://www.ietf.org/rfc/rfc3550.txt), given discussions happening in
other groups.  It's a good suggestion, and if we choose something other
than RTP, we'll no doubt need to defend this position.

 

Here are some thoughts on the implications of using RTP instead of the
simpler approach
(http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/att-00
08/speech-protocol-basic-approach-01.html.)

 

One potential objection to using RTP for HTML Speech is that our
requirements are quite different from RTP's typical use case.  RTP's
typically used for telephone-like applications: strictly metered,
low-latency, real-time transmission of audio between multiple humans
engaged in a live conversation.  UDP is used because packet loss is an
acceptable trade-off in order to maintain the necessary low-latency to
maintain the illusion of being in the same room.  It's often accompanied
by RTCP, which helps monitor QOS and provides a participant roster for
multi-party calls.  HTML Speech, on the other hand, just needs to send
audio over a WebSocket between one human and one machine as fast as it
can.  In some cases, this will be faster than real time (e.g. rendered
audio from TTS). In others it'll be much slower than real-time (e.g. on
slow networks), but that's perfectly okay for many apps.  In all cases,
packet-loss is an undesirable trade-off, because it affects SR accuracy
and TTS fidelity.

 

In theory, these differences don't matter.  Although RTP it is allowed
to be transported by a protocol other than UDP, and in applications
other than telephony.  Indeed, the RFC states "RTP may be used with
other suitable underlying network or transport protocols", and "While
RTP is primarily designed to satisfy the needs of multi-participant
multimedia conferences, it is not limited to that particular
application".  So we *could* just send each RTP message in the body of a
WebSockets binary message, and we wouldn't be violating the *spirit* of
the protocol.

 

However, another more serious objection is that RTP is a more
complicated design than we need.  For the requirements we've identified,
aside from the encoded media, our header requirements are minimal (a
mux-channel number, and possibly a timestamp).  However, RTP has a
number of header fields that, although we don't currently know of any
use for them in the HTML Speech scenarios, implementers would need to
support to some degree.  For example: optional padding, an optional
header extension, the optional designation of multiple sources where the
media stream was selected from one of those sources, and an optional
marker bit. Implementations would need to decide what to do with each of
these.  It may be okay to just throw an error if any of the optional
settings are used.  But then one has to ask what additional benefit
we're getting from this protocol.  There's also a sequence number in
each message, which is useful if you're using UDP, where packets may be
dropped or delivered out of order, but is redundant if transported over
WebSockets, which has guaranteed in-order delivery.  Multiplexing is
also problematic.  RTP has no multiplexing, and relies on the underlying
transport to do this, i.e. the IP & port of the UDP packet.  There's no
equivalent mechanism in WebSockets, which has no native support for
multiplexing.  One option is to use RTP's SSRC field (synchronization
source): since each stream has its own SSRC, this should serve to mux
the streams.  The RFC specifically states: "Separate audio and video
streams SHOULD NOT be carried in a single RTP session and demultiplexed
based on the payload type or SSRC fields."  However, the reasons given
for this are either outside the scope of our scenarios (mixing), or not
applicable because we're using a transport layered over TCP rather than
UDP (i.e. guaranteed in-order delivery).  The other option is to insert
a channel number ahead of each RTP packet, at which point the RTP header
becomes completely redundant.

 

Bottom line: Although RTP would work, it would result in added
complexity for no apparent benefit. Our requirements are apparently much
simpler than the problems RTP solves.

 

What do you think?  Am I missing something?

 

Thx.

Received on Tuesday, 14 June 2011 15:23:37 UTC