[GGIE] Embedded video metadata

Dear GGIE folks,

Here’s an expanded discussion of some of the topics that we discussed on the previous call. 

Many second-screen applications depend on having rich metadata associated with video. But how is the metadata connected to the video?

There are various kinds of metadata that have a different temporal scope. For example, metadata that describes which actors are in a movie applies to the entire content. You could have metadata that describes the latitude and longitude in which each chapter of a video was filmed. Or you could synchronise the position of cars on a racetrack, which requires metadata synced to the precise timecode. At Cisco, we’ve made apps that demonstrate each of these use cases. 

Second screen applications require a way to know what video is being displayed. For example, you could tell an application that you are watching channel 4, and then the app could check the schedule to find out what show is currently playing. Even better, the app could call an API that can talk directly to a set-top box to find out what show is currently playing, even if the user is watching a time-delayed or recorded program. An API could also allow the app to get the current timecode and playback speed, or even change the timecode and playback speed for two-way synchronization. The Cisco Videoscape Open API 

Once the application knows the content, timecode and playback speed, it can do something interesting with time-based metadata… but today’s second screen apps need a separate backchannel to get that metadata. An app has to recognize a particular piece of content, go fetch the metadata from somewhere else, and then display it to the user. Anything that interferes with the application’s ability to recognize the video, such as downloading, re-encoding, and retransmission, may break this mechanism. 

There are two approaches to making metadata “stick” to the video. One is to have a persistent identifier for the video, which can be used to access a metadata source. From a programming perspective, this is passing the data by reference. You only need to add a short identifier string to the video, but any app that wants to do something with that video would need to recognise the string and go off somewhere else to find the relevant metadata. 

Another approach would be to embed the metadata directly into the video, in the form of extra data blocks (text, key/value store, XML, binary data, whatever). This is the equivalent of passing the data by copy. This has the advantage that the data is tightly coupled to the video, so that any application that knows how to handle a particular type of metadata can use it directly, without having to know where else to find it on the network. 

For example, you could add an XML blurb to a video file containing information about the actors, could add subtitles that apply to particular segments of the video, or could add data specifying the position of race cars to every frame. 

With JPEG photos, EXIF has basically solved this problem. At the time that a photo is taken, cameras can add various kinds of information to a key/value store that is embedded in binary data inside the photo file. For example, all digital cameras have a clock, so they can embed the time when each photo was taken. Many also have a GPS, so they can embed the location where each photo was taken as well. Even when the photo is copied to a computer, edited in Photoshop, and uploaded to the web, the EXIF metadata lives on inside the photo. 

Now imagine this use case: at a sporting event in a stadium, hundreds of people are capturing video of the same event from different angles. Each spectator may film different portions of the event. Now suppose they have all uploaded the video that they have captured to various websites. Now let’s say a video producer wants to find all these videos taken by different people, download them and edit them into one continuous narrative. 

Now suppose each person’s video camera (which is often a smartphone) captures the precise time at which the user hit the record button, and the exact latitude and longitude of their location. The producer could query a website for all videos shot inside a given time period (regardless of when they were edited or uploaded), and which were shot inside a given geofence (within 500 meters of the center of the stadium). 

Now imagine they could download and drop all of these videos into an editing program. Like magic, all the videos would appear in the timeline as different tracks according to when they were shot, allowing the producer to pick the best video for a given timecode. They wouldn’t have to spend any time aligning them to get different views of the same action. And now image that at any given timecode, the producer could see a map of the stadium with an icon showing where each spectators was sitting when they shot a given video. With the location embedded in each video, this would be possible. 

Just to add an extra dimension to the story, imagine the video is being shot by a moving camera. Rather than simply recording the latitude and longitude of the place where the user was standing when they filmed a video, the camera could record the current camera location with every frame. So if a camera were mounted inside a car as it was going around a track, the car’s location could be embedded right inside the video. 

Surely a lot of smart folks have already been working on these topics, so it would be interesting to see what has already been done in this space.

Cheers,
Andrew

Received on Wednesday, 3 June 2015 14:38:17 UTC