Copyright © 2003 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark and document use rules apply.
Many applications that involve multimedia content make use of some form of metadata that describe this content. The present document aims at providing guidelines for using Semantic Web languages and technologies in order to create, store, manipulate, interchange and process image metadata. It gives a number of use cases to exemplify the use of Semantic Web technology for image annotation, an overview of RDF and OWL vocabularies developed for this task and an overview of relevant tools.
Note that many approaches to image annotation predate Semantic Web technology. Interoperability between these technologies and RDF and OWL-based approaches will be addressed in a future document.
Institutions and organizations with research and standardization activities in the area of multimedia, professional (museums, libraries, audiovisual archives, media production and broadcast industry, image and video banks) and non-professional (end-users) multimedia annotators.
This is a public (WORKING DRAFT) Working Group Note produced by the Multimedia Annotation in the Semantic Web Task Force of the W3C Semantic Web Best Practices & Deployment Working Group, which is part of the W3C Semantic Web activity.
Discussion of this document is invited on the public mailing list public-swbp-wg@w3.org (public archives). Public comments should include "comments: [MM]" at the start of the Subject header.
Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress. Other documents may supersede this document.
TO BE DONE: Corrections, delete and add material
Before starting any image annotation project, one should be aware that image annotation is notoriously difficult. Trade offs along several dimensions make the task difficult:
Manual versus automatic annotation and the "Semantic Gap" . In general, manual annotation can provide image desciptions at the right level of abstraction. It is, however, time consuming and thus expensive. In addition, it proves to be highly subjective: different human annotators tend to "see" different things in the same image. On the other hand, annotation based on automatic feature extraction is relatively fast and cheap, and free of human bias. It tends to result, however, in image descriptions that are too low level for many applications. The difference between the low level feature descriptions provided by image analysis tools and the high level content descriptions required by the applications is often referred to, in the literature, as the Semantic Gap . In the remainder, we will discuss use cases, vocabularies and tools for both manual and automatic image annotation.
Different vocabularies for different types of metadata. While various classifications of metadata have been described in the literature, every annotator should at least be aware of the difference between annotations describing properties of the image itself, and those describing the subject matter of the image, that is, the properties of the objects, persons or concepts depicted by the image. In the first category, typical annotations provide information about title, creator, resolution, image format, image size, copyright, year of publication, etc. Many applications use a common, predefined and relatively small vocabulary defining such properties. Examples include the Dublin Core and VRA Core vocabularies [add refs]. The second category describes what is depicted by the image, which can vary wildly with the type of image at hand. As a result, one sees a large variation in vocabularies used for this purpose. Typical examples vary from domain-specific vocabularies (for example, with terms that are very specific for astronomy images, or sport images, etc) to domain-independent ones (for example, a vocabulary with terms that are sufficiently generic to describe any news photo). In addition, vocabularies tend to differ in size, granularity, formality etc. In the remainder, we discuss the above metadata categories. Note that in the first type it is not uncommon that a vocabulary only defines the properties and defers the definitions of the values of those properties to another vocabulary. This is true, for example, for both Dublin Core and VRA Core. This means that typically, in order to annotate a single image one uses terms from multiple vocabularies.
A museum in fine arts has asked a specialized company to produce high resolution digital scans of the most important art works of their collections. The museum's quality assurance requires the possibility to track when, where and by whom every scan was made, with what equipment, etc. The museum's internal IT departement, maintainaing the underlying image database, needs the size, resolution and format of every resulting image. It also needs to know the repository ID of the original work of art. The company developing the museum's website additionally requires copyright information (that varies for every scan, depending on the age of the original work of art and the collection it originates from). It also want to give the users of the website access to the collection, not only based on the titles of the paintings and names of their painters, but also based on the topics depicted ('sun sets'), genre ('self portraits'), style ('post-impressionism'), period ('fin de siecle'), region ('west european').
Audiovisual archives manage very large multimedia databases. For instance, INA, the French Audiovisual National Institute, has been archiving TV documents for 50 years and radio documents for 65 years and stores more than 1 million hours of broadcast programs. The images and sound archives kept at INA are either intended for professional use (journalists, film directors, producers, audiovisual and multimedia programmers and publishers, in France and worldwide) or communicated for research purposes (for a public of students, research workers, teachers and writers). In order to allow an efficient access to the data stored, most of the parts of these video documents are described and indexed by their content. The global multimedia information system should then be fine-grain enough detailed to support some very complex and precise queries. For example, a journalist or a film director client might ask for an excerpt of a previously broadcasted program showing the first goal of a given football player in its national team, scored with its head. The query could additionally contain some more technical requirements such that the goal action should be available according to both the front camera view and the reverse angle camera view. Finally, the client might or might not remember some general information about this football game, such that the date, the place and the final score.
A media production house requires several web services in order to organise and implement its projects. Usually, the pre-production and production start from location, people, image and footage search and retrieval in order to speed up the process and reduce as much as possible the cost of the production. For that reason, several multimedia archives (image and video banks, location management databases, casting houses etc) provide the above information through the web. Everyday, media producers, location managers, casting managers etc, are looking in the above archives in order to find the appropriate resources for their project. The quality of this search and retrieval process directly affects the quality of the service that the archives provide to the users. In order to facilitate the above process, the annotation of image content should make use of the Semantic Web technologies, also following the multimedia standards in order to be interoperable with other archives. Using for example the tools described below, people that archives the content in the media production chain can provide all the necessary information (administrative, structural and descriptive metadata) in a standard form (RDF, OWL) that will be easily accessible for other people over the web. Using the Semantic Web standards, the archiving, search and retrieval processes will then make use of semantic vocabularies (ontologies) describing information concerning the structure of the content from thematic categories to description of the main objects appearing in the content with its main visual characteristics etc. In this way, multimedia archives will make their content easily accessible over the web, providing a unified framework for media production resource allocation.
The late advances in digital technologies (cameras, computers,
storage, communication etc) caused a huge increase of digital
multimedia information captured stored and distributed by personal
users over the web. Digital formats provide now the most cheap, safe
and easy way to broadly capture, store and deliver multimedia content.
Most personal users have thousands of photos (from vacations, parties,
travelling, conferences, everyday life etc), usually stored in several
resolutions on the hard disk of their computers in a simple directory
structure without any metadata. Ideally, the user wants to easily
access this content, view it, create presentations, use it in his
homepage, deliver it over the internet to other people, make part of it
accessible for other people or even sale part of it to image banks etc.
But unfortunately, the only way for this content to be accessed is by
browsing the directories, the name of which usually provides the date
and describes with one-two words the original event captured by the
specific photos. Obviously, this access becomes more and more difficult
since the number of photos increases everyday and unfortunately very
soon the content will practically become unaccessible. The best
solution to this problem covering almost all the present and future
uses of the content is the description and archiving of each photo with
the aid of a tool providing a semantic metadata structure using the
Semantic Web technologies (see for example the tools below). Using the
above tools, the users can access the photos with the aid of virtual
views (taxonomies), keywords describing their content or administrative
information (like the date of archiving, the resolution of the photo)
etc. And the most inportant, the standardisation of the metadata format
ensures the accessibility of the content from other people, the sharing
and use of it over the web.
Link: http://maenad.dstc.edu.au/slittle/mpeg7.owl
Summary: Chronologically the first one, this MPEG-7 ontology was firstly developped in RDFS [1], then converted into DAML+OIL, and is now available in OWL. This is an OWL Full ontology (note: execpt for the corrections of three small mistakes inside the OWL file. The &xsd;nil should be replace by &rdf;nil, otherwise it is not OWL valid).
The ontology covers the upper part of the Multimedia Description Scheme (MDS) part of the MPEG-7 standard. It consists in about 60 classes and 40 properties.
References:
Link: http://elikonas.ced.tuc.gr/ontologies/av_semantics.zip.
Summary: Starting from the previous ontology, this MPEG-7 ontology covers the full Multimedia Description Scheme (MDS) part of the MPEG-7 standard. It contains 420 classes and 175 properties. This is an OWL DL ontology.
References:
Link: http://dmag.upf.edu/ontologies/mpeg7ontos/.
Summary: This MPEG-7 ontology has been produced fully automatically from the MPEG-7 standard in order to give it a formal semantics. For such a purpose, a generic mapping XSD2OWL has been implemented. The definitions of the XML Schema types and elements of the ISO standard have been converted into OWL definitions according to the table given in [3]. This ontology could then serve as a top ontology thus easing the integration of other more specific ontologies such as MusicBrainz. The authors have also proposed to transform automatically the XML data (instances of MPEG-7) into RDF triples (instances of this top ontology).
This ontology aims to cover the whole standard and it thus the most complete one (with respect to the previous mentioned). It contains finally 2372 classes and 975 properties. This is an OWL Full ontology since it employs the rdf:Property construct to cope with the fact that there are properties that have both datatype and object type ranges.
References:
Link: store this ontology on CWI for ease of reference?
Link: http://www.acemedia.org/aceMedia/reference/resource/index.html, the current version is 9.0.
Summary: The Visual Descriptor Ontology (VDO) developed within the aceMedia project for semantic multimedia content analysis and reasoning, contains representations of MPEG-7 visual descriptors and models Concepts and Properties that describe visual characteristics of objects. By the term descriptor we mean a specific representation of a visual feature (color, shape, texture etc) that defines the syntax and the semantics of a specific aspect of the feature. For example, the dominant color descriptor specifies among others, the number and value of dominant colors that are present in a region of interest and the percentage of pixels that each associated color value has. Although the construction of the VDO is tightly coupled with the specification of the MPEG-7 Visual Part, several modifications were carried out in order to adapt to the XML Schema provided by MPEG-7 to an ontology and the data type representations available in RDF Schema
References:
Link: http://www.mindswap.org/2005/owl/digital-media.
Summary:
References:
Link: http://www.cs.vu.nl/~laurah/VO/visualWordnetschema2a.rdfs.
Summary:
References:
TO BE DONE: Short description and categorisation of important relevant work
TO BE DONE: Short description and categorisation of important projects and events