The Bitstream Fallacy - an explanation from Harald Alvestrand on 2012-03-15 (public-media-capture@w3.org from March 2012)

From: Harald Alvestrand <harald@alvestrand.no>
Date: Thu, 15 Mar 2012 18:08:46 +0100
To: "public-media-capture@w3.org" <public-media-capture@w3.org>
Message-ID: <4F62221E.1040107@alvestrand.no>

Some time ago, Travis Leithead asked me what I meant by "the bitstream
fallacy" when discussing APIs. I decided to sit down and write some text
about it.

I hope the below may be informative. If not, it may be entertaining.

Harald
----------------------------------------------------------------------------------------------------------------
*When discussing the various APIs involved in audio and video in a Web
browser, we frequently hear statements that, if taken at face value,
translate to "here we pass the data stream from one object to the other".

Well .... no. We don't.

It's easy to fall into that trap, and especially easy to imagine it when
we're using a programming model that looks like "pipes connecting
nodes", with some of the nodes doing things like showing video on a
screen, fetching audio from a file, or passing a signal across a network.

One imagines that each step in the process involves a stream of bits and
bytes flowing between our objects, carrying our sound and pictures
through all the various steps we specify for them.

But it isn't that way in reality, and we'd better be aware of it, even
though our APIs will weave an ever more complete illusion - and even
though that illusion is the one we program against using that API.

Consider a really simple scenario, from a videoconferencing situation: A
video camera on your computer records your face and a microphone records
your voice; it's passed to my computer and presented on my screen, and
I've chosen to record the session to a file. The bytes flow; what could
be simpler?

Except that this is not what's happening.

Look at the wire between the camera and the computer. It's an USB cable,
carrying a complex negotiation protocol, which, inside it, carries one
picture at a time from the camera's CCD sensor to your computer's memory
buffer, in a format called YUV2 - which is very easy to decode, but
takes a *lot* of space.
In your computer, the receiver driver decodes the data and formats it in
a suitable form for your memory buffer - which is NOT YUV2; it's an
internal format.
Every 40 msec, a timing signal comes from the camera, signalling that a
picture is complete; the driver switches its writing to another memory
buffer, leaving the first buffer to the display handlers.
One display handler does a quick transpose of the buffer to a buffer on
your graphics card, which will then rescale the image using specialized
hardware to display in a corner of your screen as your "self-image".
Another display handler passes the buffer content to a codec encoding
routine, which will carefullly compare the image to the previous image
transferred, and pick the most efficient mechanism for signalling the
differences between the image and the previous one - packing this into a
series of smaller "packet" buffers, and equipping each packet with a
header that says where it's coming from, where it's supposed to go, and
a timestamp that says which picture frame it belongs to.
Once the encoding process is complete, control of the packet buffers is
handed to the network card, which takes care of sending them across the
network - well before the codec is asked to start encoding the next frame.
The connection between the computers is not a simple bit pipe either.
Sometimes packets get lost; the logic in the pipe has to deal with
figuring out whether the loss matters, and asking the sender to do
something about it if it mattered - either send it again (causing delay)
or decide to send the next picture in such a way that it can be decoded
without reference to the lost packet; the last version requires the
network component to reach back into the codec component and tell it to
behave differently than it otherwise would.

Once the packets arrive at my machine, the inverse process happens: The
stream of packets gets decoded into a memory buffer, my machine's
display functions blit the buffer into my graphics card memory for
transformation and display, and somewhere along the way - which may be
either in the decoder or at the memory buffer - some function picks up
the incoming stream of data, possibly reencoding it into another codec's
format, decorates it with the necessary markers for saying which pieces
belong where (often the Matroska or AVI container file formats),
combines it with the similarly-processed stream of data from the audio,
and writes it to a file.

There are a few steps along this very simplified picture of a processing
pipeline where we can talk about a stream of bytes: At the USB cable and
at the file interface.
At all other points on the processing pipeline, there is complex
interaction, timers, buffers, packets and logic that completely confound
the "stream".

And I've completely ignored the process of negotiation among the various
parties that precedes the transmission, and sometimes renegotiates in
the middle; when I resize my display window to be able to see your face
better, it's entirely possible that signalling will go all the way back
to your camera and tell it to change its resolution - without any
intervention from the controllers.

We need to view this process as a pipeline because that's a model we can
usefully deal with - and the software beneath the surface is capable of
transforming that model into an useful set of configurations of
components to perform the functions we want.

But we should not forget that it's only an useful model. It's not the truth.
*

Received on Thursday, 15 March 2012 17:09:35 UTC