W3C home > Mailing lists > Public > public-media-capture@w3.org > March 2012

The Bitstream Fallacy - an explanation

From: Harald Alvestrand <harald@alvestrand.no>
Date: Thu, 15 Mar 2012 18:08:46 +0100
Message-ID: <4F62221E.1040107@alvestrand.no>
To: "public-media-capture@w3.org" <public-media-capture@w3.org>
Some time ago, Travis Leithead asked me what I meant by "the bitstream 
fallacy" when discussing APIs. I decided to sit down and write some text 
about it.

I hope the below may be informative. If not, it may be entertaining.

*When discussing the various APIs involved in audio and video in a Web 
browser, we frequently hear statements that, if taken at face value, 
translate to "here we pass the data stream from one object to the other".

Well .... no. We don't.

It's easy to fall into that trap, and especially easy to imagine it when 
we're using a programming model that looks like "pipes connecting 
nodes", with some of the nodes doing things like showing video on a 
screen, fetching audio from a file, or passing a signal across a network.

One imagines that each step in the process involves a stream of bits and 
bytes flowing between our objects, carrying our sound and pictures 
through all the various steps we specify for them.

But it isn't that way in reality, and we'd better be aware of it, even 
though our APIs will weave an ever more complete illusion - and even 
though that illusion is the one we program against using that API.

Consider a really simple scenario, from a videoconferencing situation: A 
video camera on your computer records your face and a microphone records 
your voice; it's passed to my computer and presented on my screen, and 
I've chosen to record the session to a file. The bytes flow; what could 
be simpler?

Except that this is not what's happening.

Look at the wire between the camera and the computer. It's an USB cable, 
carrying a complex negotiation protocol, which, inside it, carries one 
picture at a time from the camera's CCD sensor to your computer's memory 
buffer, in a format called YUV2 - which is very easy to decode, but 
takes a *lot* of space.
In your computer, the receiver driver decodes the data and formats it in 
a suitable form for your memory buffer - which is NOT YUV2; it's an 
internal format.
Every 40 msec, a timing signal comes from the camera, signalling that a 
picture is complete; the driver switches its writing to another memory 
buffer, leaving the first buffer to the display handlers.
One display handler does a quick transpose of the buffer to a buffer on 
your graphics card, which will then rescale the image using specialized 
hardware to display in a corner of your screen as your "self-image".
Another display handler passes the buffer content to a codec encoding 
routine, which will carefullly compare the image to the previous image 
transferred, and pick the most efficient mechanism for signalling the 
differences between the image and the previous one - packing this into a 
series of smaller "packet" buffers, and equipping each packet with a 
header that says where it's coming from, where it's supposed to go, and 
a timestamp that says which picture frame it belongs to.
Once the encoding process is complete, control of the packet buffers is 
handed to the network card, which takes care of sending them across the 
network - well before the codec is asked to start encoding the next frame.
The connection between the computers is not a simple bit pipe either. 
Sometimes packets get lost; the logic in the pipe has to deal with 
figuring out whether the loss matters, and asking the sender to do 
something about it if it mattered - either send it again (causing delay) 
or decide to send the next picture in such a way that it can be decoded 
without reference to the lost packet; the last version requires the 
network component to reach back into the codec component and tell it to 
behave differently than it otherwise would.

Once the packets arrive at my machine, the inverse process happens: The 
stream of packets gets decoded into a memory buffer, my machine's 
display functions blit the buffer into my graphics card memory for 
transformation and display, and somewhere along the way - which may be 
either in the decoder or at the memory buffer - some function picks up 
the incoming stream of data, possibly reencoding it into another codec's 
format, decorates it with the necessary markers for saying which pieces 
belong where (often the Matroska or AVI container file formats), 
combines it with the similarly-processed stream of data from the audio, 
and writes it to a file.

There are a few steps along this very simplified picture of a processing 
pipeline where we can talk about a stream of bytes: At the USB cable and 
at the file interface.
At all other points on the processing pipeline, there is complex 
interaction, timers, buffers, packets and logic that completely confound 
the "stream".

And I've completely ignored the process of negotiation among the various 
parties that precedes the transmission, and sometimes renegotiates in 
the middle; when I resize my display window to be able to see your face 
better, it's entirely possible that signalling will go all the way back 
to your camera and tell it to change its resolution - without any 
intervention from the controllers.

We need to view this process as a pipeline because that's a model we can 
usefully deal with - and the software beneath the surface is capable of 
transforming that model into an useful set of configurations of 
components to perform the functions we want.

But we should not forget that it's only an useful model. It's not the truth.
Received on Thursday, 15 March 2012 17:09:35 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:26:09 UTC