- From: Harald Alvestrand <harald@alvestrand.no>
- Date: Thu, 15 Mar 2012 18:08:46 +0100
- To: "public-media-capture@w3.org" <public-media-capture@w3.org>
- Message-ID: <4F62221E.1040107@alvestrand.no>
Some time ago, Travis Leithead asked me what I meant by "the bitstream fallacy" when discussing APIs. I decided to sit down and write some text about it. I hope the below may be informative. If not, it may be entertaining. Harald ---------------------------------------------------------------------------------------------------------------- *When discussing the various APIs involved in audio and video in a Web browser, we frequently hear statements that, if taken at face value, translate to "here we pass the data stream from one object to the other". Well .... no. We don't. It's easy to fall into that trap, and especially easy to imagine it when we're using a programming model that looks like "pipes connecting nodes", with some of the nodes doing things like showing video on a screen, fetching audio from a file, or passing a signal across a network. One imagines that each step in the process involves a stream of bits and bytes flowing between our objects, carrying our sound and pictures through all the various steps we specify for them. But it isn't that way in reality, and we'd better be aware of it, even though our APIs will weave an ever more complete illusion - and even though that illusion is the one we program against using that API. Consider a really simple scenario, from a videoconferencing situation: A video camera on your computer records your face and a microphone records your voice; it's passed to my computer and presented on my screen, and I've chosen to record the session to a file. The bytes flow; what could be simpler? Except that this is not what's happening. Look at the wire between the camera and the computer. It's an USB cable, carrying a complex negotiation protocol, which, inside it, carries one picture at a time from the camera's CCD sensor to your computer's memory buffer, in a format called YUV2 - which is very easy to decode, but takes a *lot* of space. In your computer, the receiver driver decodes the data and formats it in a suitable form for your memory buffer - which is NOT YUV2; it's an internal format. Every 40 msec, a timing signal comes from the camera, signalling that a picture is complete; the driver switches its writing to another memory buffer, leaving the first buffer to the display handlers. One display handler does a quick transpose of the buffer to a buffer on your graphics card, which will then rescale the image using specialized hardware to display in a corner of your screen as your "self-image". Another display handler passes the buffer content to a codec encoding routine, which will carefullly compare the image to the previous image transferred, and pick the most efficient mechanism for signalling the differences between the image and the previous one - packing this into a series of smaller "packet" buffers, and equipping each packet with a header that says where it's coming from, where it's supposed to go, and a timestamp that says which picture frame it belongs to. Once the encoding process is complete, control of the packet buffers is handed to the network card, which takes care of sending them across the network - well before the codec is asked to start encoding the next frame. The connection between the computers is not a simple bit pipe either. Sometimes packets get lost; the logic in the pipe has to deal with figuring out whether the loss matters, and asking the sender to do something about it if it mattered - either send it again (causing delay) or decide to send the next picture in such a way that it can be decoded without reference to the lost packet; the last version requires the network component to reach back into the codec component and tell it to behave differently than it otherwise would. Once the packets arrive at my machine, the inverse process happens: The stream of packets gets decoded into a memory buffer, my machine's display functions blit the buffer into my graphics card memory for transformation and display, and somewhere along the way - which may be either in the decoder or at the memory buffer - some function picks up the incoming stream of data, possibly reencoding it into another codec's format, decorates it with the necessary markers for saying which pieces belong where (often the Matroska or AVI container file formats), combines it with the similarly-processed stream of data from the audio, and writes it to a file. There are a few steps along this very simplified picture of a processing pipeline where we can talk about a stream of bytes: At the USB cable and at the file interface. At all other points on the processing pipeline, there is complex interaction, timers, buffers, packets and logic that completely confound the "stream". And I've completely ignored the process of negotiation among the various parties that precedes the transmission, and sometimes renegotiates in the middle; when I resize my display window to be able to see your face better, it's entirely possible that signalling will go all the way back to your camera and tell it to change its resolution - without any intervention from the controllers. We need to view this process as a pipeline because that's a model we can usefully deal with - and the software beneath the surface is capable of transforming that model into an useful set of configurations of components to perform the functions we want. But we should not forget that it's only an useful model. It's not the truth. *
Received on Thursday, 15 March 2012 17:09:35 UTC