Re: video use-case from Silvia Pfeiffer on 2008-10-14 (public-media-fragment@w3.org from October 2008)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Wed, 15 Oct 2008 05:05:11 +1100
To: "Raphaël Troncy" <Raphael.Troncy@cwi.nl>
Cc: "Yves Lafon" <ylafon@w3.org>, "Jack Jansen" <Jack.Jansen@cwi.nl>, "Yannick Prié" <yannick.prie@liris.cnrs.fr>, "Media Fragment" <public-media-fragment@w3.org>
Message-ID: <2c0e02830810141105i22a40a94md7f82d42402a23b6@mail.gmail.com>

Hi Raphael,

On Tue, Oct 14, 2008 at 11:07 PM, Raphaël Troncy <Raphael.Troncy@cwi.nl> wrote:
>> The biggest problem here being that
>> time is inherently inaccurate (for being essentially a floating point
>> number) while bytes are accurate (for being an integer). So, if you
>> are asking for times 1:23-1:45.32 and then 1:45.33-1:56 in two
>> fragment requests, it is somewhat impossible for the Web proxy to know
>> whether that is enough data to compose 1:23-1:56 or whether it has
>> accidentally missed or duplicated a few dozen bytes because they fell
>> into the gap between the two segments because the time resolution
>> cannot be made completely accurate for media resources.
>
> I (and others) did propose a while ago an internal representation of time
> using the least common multiple between the usual sound sample rate (96000
> and sub-multiple or 44100 and sub-multiple) and video frame rate (30, 25,
> 24), that is 14112000. This integer defines then an universal common sample
> rate (i.e. 14112000 corresponds to 1 second) and any temporal point in an
> audio-visual content will be represented as an integer on this temporal
> basis. You can see the details in the paper:
>
> Raphaël Troncy, Jean Carrive, Steffen Lalande and Jean-Philippe Poli.
> "A Motivating Scenario for Designing an Extensible Audio-Visual Description
> Language". In International Workshop on Multidisciplinary Image, Video, and
> Audio Retrieval and Mining (CoRIMedia), Sherbrooke, Canada, October 25-26,
> 2004.
> http://www.cwi.nl/~troncy/Publications/Troncy-corimedia04.pdf
>
> Does that help to solve the problem?

Interesting idea. In fact - I really enjoyed reading the paper. :-)

However, this won't help. Let me explain.

In Annodex we went even further: we defined that you could choose an
appropriate temporal resolution for your video by providing a 64bit
unsigned integer as the denominator for temporal resolutions
(http://annodex.net/TR/draft-pfeiffer-annodex-02.html section 4.1). In
your case, it would store 14112000. This helps in accurately
specifying e.g what exact time offset the video starts from. Thus, the
video itself can specify at a high resolution where it starts and
ends.

However, a randomly picked temporal fragment request for a video can
almost never be *composed* to be correctly started and ended at the
given times.

So, I may be requesting 1:23-1:45.32. The file that is being retrieved
for this time interval is a compressed audio/video file and we do not
want to re-encode it. Also, we want to make sure that the video
fragment that is returned is still a complete video that can be
decoded by the receiver. The encoding packets of the video may
therefore only allow us to reply e.g. with the fragment 1:22.30 -
1:46.00. The client video player won't mind this, because to a human
that is close enough in accuracy. But to the proxy in the middle it
makes a big difference, since e.g. the next request for 1:45.33-1:56
may only be satisfyable with 1:45.00 - 1:56.30. Thus, the proxy may
have a second duplicated in the middle that it does not know about.

Having said all this and having thought about the problem for the last
weeks and discussed it with people here, there may however still be a
chance to use time ranges and # fragments as part of the client-server
communication. I need another discussion internally here before I'm
ready to describe it. Bear with me.

Cheers,
Silvia.

Received on Tuesday, 14 October 2008 18:05:47 UTC