Copyright © 2008 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
This document specifies the use cases and requirements that have motivated the development of the media ontology 1.0. The media ontolgoy is a simple ontology to support cross-community data integration of information related to media objects on the Web.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is an Editorial Draft of a possible future W3C Recommendation.[is it? I thought it was just a deliverable for the Working Group]
Comments on this document may be sent to the public public-media-annotation@w3.org mailing list (archived at http://lists.w3.org/Archives/Public/public-media-annotation/).
Publication as a Editorial Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document is published as part of the W3C Video in the Web Activity by the Media Annotation Working Group. It is a deliverable as defined in the Charter of that group.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
Anticipating the increase in online video and audio in the upcoming years, we can foresee that it will become progressively more difficult for viewers to find the content using current search tools. Unlike hypertext documents, it is more complex and sometimes impossible to deduce meta information about a medium, such as its title, author, or creation date from its content. There has been a proliferation of media metadata formats for the document's authors to express this metadata information. For example, an image could potentially contain EXIF, IPTC and XMP information. There are also several metadata solutions for media related content, including MPEG-7, IPTC, iTunes XML, Yahoo! MediaRSS, Video sitemaps, CableLabs VOD Metadata Content, TV-Anytime (ETSI TS 102 822 series), EBU Core Metadata Set, and XMP.
The Media ontology 1.0 will address the underlying intercompatiblity problem by providing a common set of terms to define the basic metadata needed for media objects and the links between their values in different existing vocabularies: it would help circumventing the current profileration of video metadata formats by providing full or partial translation and mapping between the existing formats.
This document specifies the use cases and requirements that have motivated the development of the Media ontology 1.0. The purpose of this ontology is to support cross-community and cross media data integration of information related to media objects on the Web. Before presenting the use cases themselves, the following section describes media objects according to three different orthogonal dimensions, which will be used to characterise the use cases at a meta level. This section represents a view on the Ontology to be created from a theoretical perspective, while the use cases are the basis of a concret, problem oriented requirements definitions. The two approaches were considered to be complementary and were developped in parallel, to define an ontology as complete and as relevant as possible, but as the same time as simple as possible. The theoretical description of the view points on Media object is called the Top-Down Modelling Approach, the use cases and requirements are described in the sections "with the same name".
We have 3 dimensions in which we can look at the media annotation problem:
1. The media: the question here is which particular aspects of a medium need to be described to facilitate actions performed by the user. Media are different in their expression strength (e.g. visuals are strong on their denotative power, where audio or haptics are better in stimulating feelings, text is strong on paradigmatic processes). Taking in consideration what the cognitive power of a medium is might help us to distil the basics to be described to achieve the widest coverage. Media also differ in the content dimensions they support, e.g. time, 2D space, 3D space. If time is the important dimension, then description schemes designed for timed text or audio might be easily taken into account and extended to describe audio documents; if an audio document is taken into account in its visual dimension, a description scheme could be common for video and image description. Defining a document or a task given this media-related dimension helps grouping document types under a common (sub-) description scheme.
2. The context: it describes under which circumstances is the media accessed, e.g. presentation generation, pure search, mobile environment (i.e. display), etc., and the combination/embedding with other media items, e.g. inclusion in Web page, text/images/video clips in EPG, etc. The relevant questions here are: which information elements are necessary and/or relevant to achieve the (description of the) correct context? For example, in a scenario where media documents are displayed on a mobile device (see the "Mobile" use case for more details), in connection with the place where the user of the device stands, the essential attributes are the ones of ‘location’. Once they are clearly defined, we have to determine how those can be minimally described so that a larger variety of processes/actions can be performed. Our assumption here is that we do not model the processes but rather design metadata that allow the applications to handle the material appropriately; we also do not intend to model process-related metadata, e.g. processing applied to a media.
3. The actual tasks performed by the user: they will require particular information to be correctly performed. The relevant questions are: how should, whatever we design, support the tasks users perform on and with media? Which tasks (e.g. search, manipulation, generation, etc) would we like to support? Do we make a distinction between general and specific tasks (general are those that can work alone, such as search, whereas specific tasks are those that need others to be functional (e.g. for manipulating a video, it first needs to be found, retrieved and displayed). Finally, - what are the essential terms/tags/description structures we have to come up with?
For example, in the context of a search task, we can consider the requirements described in the following paragraphs for a complex annotation schema.
There are two main viewpoints that can be considered about media documents: its physical content and its semantic content. The first one is indexed with the objective of describing parts of the content to be reused in other contexts (re-used in new documents), and the second one is indexed for its message: The first one focuses on the physical level of the media (optionnally with subjective comments about feelings conveyed by the described piece), the second one on the content level. The scope of the Media ontology 1.0 is limited to content description.
The content level of a media document is often described in a particular schema, which separates the description into fields, giving a more or less defined type to the annotation value in question. But these fields are not interconnected: the annotation is global to the document, whatever document unit is selected. The fields usually represent the people seen on the screen or mentioned in the document, the names (brands, companies, bands name, etc.), location and generic content description keywords, but without the explicit association between them: what event is related to what person and what location, for example, is not clearly defined. Some schemas, like the CIDOC-CRM in the museum world [CIDOC], the structured textual annotation scheme in MPEG-7 [MPEG-7], the annotation schema of the Multimedian e-culture project [WHOS], or of the MuseumFinland project [MF] make an explicit relationship between the different elements of one description, to be attached to an object. Descriptions are then event-centered besides being document centered, some elaborating on the scheme sketched by [TL] of the "what, where, who, when, how", having the different parts of the graph filled in by values coming from ontologies. DC
In large archives and particularly for the ones which have a very homogeneous collection, making links or graphs to connect the different pieces of the annotation that belong together is very important for the precision/enhancing the search. Without such explicit links, even when using keywords for the annotation, searching in the archives brings the same problems as making a complex query on the Web witout using double quotes: there is no garantee that the elements that are searched for belong together even if they stand in the same annotation. For example, if one document describes 3 heads of State meeting at a conference and two of them shaking hands, the countries of the different heads of state are likely to be mentioned, their names, and the action "shaking hands", but it is unlikely that this action will be connceted with its actual actors. Searching for any of these head of states shaking hands will retrieve the document.
What we aim for with the Media ontology 1.0 is a minimal set of properties that allow us to describe a medium in a way that all 3 dimensions are covered (media, context and task). The general question for each dimension is: how granular do we intend to become (this then points to the question: are we aiming for tags only, clustered tags (cluster is a dimension and the associated terms are the tags), or do we aim for structured descriptions: classes with attributes, and relations between them, e.g. each dimension is a class; subclasses are e.g. the media types, of which each might have attributes.
We consider the following media: text, image, video, audio, multi-source audiovisual content (e.g. multi-view video, surround audio), 3D models and scenes, haptic, olfactory. This statement gathers what was defined at first as "the audio use case" and "the image use case": models describing these media will be taken into account when defining our ontology as an interlingua between the large variety of models. A more detailed list of formats judged "in scope" and "out of scope" has been discussed in the document [REF in scope/out of scope].
For each medium we have to consider modes, such as: static or interactive, fixed or mobile, realistic or abstract. For audio also: voice or melodic.
For each medium, type specific metadata that are necessary to access media of this type need to be supported.
Example: if we have a news video, and we wish to support the combination of videos into a personalized news show (for more details, see the "personalisation" use case), some aggregation of material needs to be performed. Queries need to be enabled to search on the following dimensions:
Each test case addresses a certain context, which requests particular tasks that need to be performed. Below a list of example use cases which do not represent a medium or mode only:
Our use cases (see the "use case" section) provide a vast number of tasks that can be performed on media, besides the above mentioned ones. Here the list of (simple) tasks extracted from the current use cases, which represent complex combinations of these:
Search, tag, adapt, personalize, filter (collaborative), reuse, exchange, map, merge, extract, maintaining, listen, read, watch, mix, generate, summarize, present, interact, query, retrieve,
The following is a list of basic tasks derived and abstracted from the above:
Tasks like "interact" are complex compositions of the above: they involve a query, retrieval, display, possibly editing and distributing. Such complex tasks can be used as test beds to be sure that the modelling of the different individual somple tasks keeps them compatible between easch other.
While it seems promising to build on basic or generic tasks (such as those defined in the Canonical Processes papers) not bound to specific use cases, these tasks might be very different depending on media (e.g. the watch/listen tasks, although otherwise well suited to define a generic consumption task), context (e.g. annotation in end user scenarios like personal image/music collection vs. archive documentation) and other aspects (e.g. there are many types of search requiring different metadata, even when searching the same media type in the same application context).
For all those basic tasks the question remains: in comparison with the other dimension, what are the basic attributes that need to be described. Therefore, we will try to describe the use cases, presented in detail in the following section, according to these different dimensions.
As already mentioned before, there are a number of tasks that require task chains:
Adapt => search for relevant material, generate new context, present
Mix => analyze media, combine material, present
Summarize => analyze the material, extract, present
Inform => observe user behavior, identify related material, generate info block, distribute
These task chains (could we call them processes as well?) seem to be interesting for our purpose, as the answer to the question “which metadata items need to be passed from one task to a subsequent one to make the chain work” helps defining the minimum set of attributes.
The analysis so far is rather process oriented. It would be necessary, though, to come up with an example where the user asks for particular facts that then can be provided in any media. Important is to figure out how much of direct content access do we have to support.
There are many video service sites on the web such as Youtube, Yahoo video, etc. These web sites allow for upload of video, and wants to display selected information about the media documents that it gets. This use-case asks not only for metadata but also web/DOM-level APIs that give uniform access to selected metadata across a variety of file formats.
Search engines are, in some senses in scope, since we'd like search and/or index engines to be able to do uniform indexing of selected metadata across a variety of formats, so again we need some level of semantic match for those metadata elements across a variety of formats. The current semantic (mis)match problem can be illustrated as follows: consider two metadata systems: a system A has tags for the following pieces of information: Title, Artist, whereas the system B has tags for these: Title, Sub-Title, Artist, Composer. We find the same work in these two formats; A Title="Dvorak Symphony 6, II Adagio", Artist="BBC Symphony Orchestra" B Title="Symphony 6", Sub-title="II Adagio", Artist="BBC Symphony Orchestra", Composer="Dvorak, Antonin"
What does the DOM API return when the script asks for "Composer": does the composer get included in the file returned by the system B's indexing, even though in system A this information has been included in the title (by default)? Does the first file, given its schema, ever get indexed under the name of Dvorak? And so on.
Note: In this working Group, we do not aim to solve the semantic mismatch problem but leave that to the application that creates the annotation or the retrieval [is it really what was decided? I thought that we would not make links between Dvorjak and Dvorak, Anton as annotation contents, but owuld aim at bridging the gap between different schemas, making links between one property stating "ArtistName" and two properties stating "ArtistFirstName" and "ArtistFamilyName", for example.].
There are several standards involved for describing video documents: CableLabs (1.1, 2.0) [VODCSV], MS TV, TV-Anytime, MPEG-7, XML-TV, Open IPTV, iTunes Podcast, YouTube, MS IMM, OMA BCAST etc. The goal of this WG is to make our ontology support the briding of commonly used properties for describing video content, from these different standards.
Some properties that will describe the content of these videos and enhance their sharing and reuse are:
The collections of cultural heritage institutions (libraries, museums, archives, etc.) are increasingly digitised and made available on the Web. For large parts of these collections comprehensive, professionally created documentation is available, however, often using domain specific or even proprietary metadata models. This hinders accessing and linking these collections.
From the point of view of the media annotation WG, the use case of audiovisual content from cultural heritage collections is a broad one and contains the following technical challenges:
The media types that are archived in a Cultural Heritage perspective range from image to video, including audio (music and radio collections, for example). Contemporary art museums can also archive Multimedia installations and libraries typically archive hyperdocuments. No type of media can be a priori excluded, but generic models like CIDOC-CRM are designed to cover such a broad range of material types. The tasks relative to this use case are: annotation, search, exchange and merging. The contexts are: searching for the documents for their content itself, for reusing or for integrating in a more generic presentation; browsing through different archives to get a global view on a topic and create new unexpected knowledge.
A list of requirements for the Media ontology 1.0 can be drived from the second part of the scenario description: an ontology taking this use case into account would have to bridge the gap between different annotation schema, taking or not the annotation of fragments into account, would have to enable the displaying of the metadata with the content itslef. The interoperability between annotation vocabularies (controlled vocabularies used at different CH institutions for the annotation) will not be takled by this ontology. The interoperability between local schema will be taken into account by first bridging different generic standards schema that can apply to CH documents.
This use case covers semantic and technical metadata descriptors for mobile devices and applications. It is motivated by the fact that mobile devices are a service platform for billions of users, which have several unique properties that comes with the mobility:
This use case implies to be able to take the location in the real world into account and compare it with location in the annotation, which boils down to have location information encoded in the metadata: the comparison should be done on the side of the system. The physical constraints related to the physical support on which the document is displayed should also be part of its metadata. In this use case, interoperability is required with other foramts than the ones listed in the Video use case: formats for identification on the Web. This last requirement makes a link with the PLING Working Group, which aims at gathering different ways of expressing the rights and privacy information on the Web and standardize it. The approach of our Working Group is to have a slot in the ontology for linking to this type of information, but not to define a particular right and privacy management schema.
The presence of multimedia resources (video, audio, image, etc.), increase rapidly in the Internet. How to use these resources and provide useful services attracts more and more interests. However, different multimedia content providers (youtube, iptv content provider, etc.) usually have their specific or proprietary metadata formats (see [http://www.w3.org/2005/Incubator/mmsem/XGR-vocabularies/ mmsem multimedia vocabularies]). This situation is one of the key problem faced by every serious service providers.
From the perspective of a service provider, let us look at the following scenario and consider what requirements can be derived from it. IPTV (Internet Protocol Television) will be the next generation of Television. In IPTV, the television will be connected to the Internet which gives the user access not only to traditional TV programs, but also to other multimedia resources available on the Internet (youtube, etc.). At the same time, how to search for the tv program will be a problem for the end users. The recommendation service provider will do recommendation (like collaborative filtering) based on metadata in various models from different content providers, like EPG in TV programs, MPEG-7 contents on the Internet, youtube like videos, and etc. To provide the service effectively, the providers should design their own specific interoperation model for various formats. Besides, some content providers on the web do not support metadata access API, so the recommendation service provider has to do web page extraction to infer the possible metadata models, which leads to very poor reusability and scalability.
By having the workouts from this group, the service providers will be able to design and implement the algorithms based on an interlinked vocabulary. The standardized metadata makes related data mining and recommendation easier. In MPEG-7, there are parts related to this problem, however, it merely appears in other standards.
The main task for this use case is recommendation, as indicated in the title. The context is Internet and TV content metadata access and merging, for providing a combined recommendation.
The requirements related to this use case is the interoperability between different media description schemas, the ones listed in the scenario description and potentially others.
The tagging use case is one of the use cases which has been elaborated by the Multimedia Semantics (MMSEM) Incubator Group (XG).
The tagging use case is motivated by the fact that users tag different resource types on different plattforms, but cannot exchange or reuse "their" tags across these plattforms. Further a personal folksnomy (personomy) mainted on a desktop computer cannot be reuse online and vice versa. The authors discuss a solution to this problem by representing a personomy using SKOS Core which then can be exchanged between systems or hosted by a central service provider.More information about the tagging use case can be found in the MMSEM XG Interoperability deliverable.
The requirements that we can draw from this use case is that, when dealing with interoperability, the content of the tags has to be communicated between the systems, to enhance a cross paltform query mechanism.
A person captures his experience as well as their entire lives by creating images, audios and videos in the web. They are namely a life logs today. Those life logs are made by various information such as time, location, creator's profile, human relations, and even emotion. In case the life logs are annotated by means of ontology, he/she can easily and efficiently search for his/her personal life log information in the web whenever necessary. Life logs also can be mixed up with geolocation information (auto tagging to the Google Map) for easy search and interaction in the web.
In this section, we gather the requirements that were derived from the different use cases and list some complementary ones.
Complementary requirements:
[CIDOC] The CIDOC Conceptual Reference Model, http://cidoc.ics.forth.gr/index.html.
[MF] MuseumFinland -- Finnish Museums on the Semantic Web, http://www.museosuomi.fi/
[MPEG-7] Information Technology - Multimedia Content Description Interface (MPEG-7). Standard No. ISO/IEC 15938:2001, International Organization for Standardization(ISO), 2001.
[MR] Metaverse Roadmap , http://www.metaverseroadmap.org/roadmap.html
[TL] Structured Natural-Language Description for Semantic Content Retrieval, A M Tam, C H C Leung, Journal of the American Society for Information Science, 2001
[VODCSV] Video-On-Demand Content Specification Version 2.0, MD-SP-VOD-CONTENT2.0-I02-070105, CableLabs , 2007, http://www.cablelabs.com/specifications/MD-SP-VOD-CONTENT2.0-I02-070105.pdf
[WHOS] Thesaurus-based Search in Large Heterogeneous Collections, Jan Wielemaker, Michiel Hildebrand, Jacco van Ossenbruggen, and Guus Schreiber, ISWC 2008.
[XMLTV] XML TV Project , http://wiki.xmltv.org/index.php/XMLTVProject