Image annotation on the Semantic Web
This version:
N/A
Latest version:
N/A
Previous version:
N/A
Editors:
TO BE REVISED AT
THE END
Giorgos Stamou, IVML, National Technical
University of Athens, <gstam@softlab.ece.ntua.gr>
Jacco van Ossenbruggen, Center for Mathematics and Computer Science (CWI), < Jacco.van.Ossenbruggen@cwi.nl
>
Raphaël Troncy, Center for Mathematics and Computer Science (CWI), < Raphael.Troncy@cwi.nl
>
Additional Contributors and Special Thanks to:
TO BE REVISED AT
THE END
Jane Hunter, DSTC, <jane@dstc.edu.au>
Guus Schreiber, VU,<schreiber@cs.vu.nl>
Vassilis Tzouvaras, IVML, National
Technical University of Athens, <tzouvaras@image.ece.ntua.gr>
Nikolaos Simou, IVML, National Technical
University of Athens, <nsimou@image.ece.ntua.gr>
Christian Halaschek-Wiener, UMD, <halasche@cs.umd.edu>
Jeff Pan,
Jeremy Caroll, HP, <jjc@hplb.hpl.hp.com>
John Smith, IBM, <rsmith@watson.ibm.com>
Copyright © 2003 W3C ® (MIT, ERCIM,
Keio),
All Rights Reserved. W3C liability, trademark and document use rules apply.
Many applications that involve multimedia content make use of some form
of metadata that describe this content. This document provides guidelines for using
Semantic Web languages and technologies in order to create, store, manipulate,
interchange and process image metadata. It gives a number of use cases to
exemplify the use of Semantic Web technology for image annotation, an overview
of RDF and OWL vocabularies developed for this task
and an overview of relevant tools.
Note that many approaches to image annotation predate Semantic Web
technology. Interoperability between these technologies and RDF
and OWL-based approaches, however, will be addressed in a future document.
Institutions and organizations with research and standardization
activities in the area of multimedia, professional (museums, libraries,
audiovisual archives, media production and broadcast industry, image and video
banks) and non-professional (end-users) multimedia annotators.
This is a public (WORKING DRAFT) Working Group Note produced by the Multimedia
Annotation in the Semantic Web Task Force of the W3C
Semantic Web Best Practices & Deployment Working Group, which is
part of the W3C
Semantic Web activity.
Discussion of this document is invited on the public mailing list public-swbp-wg@w3.org
(public archives). Public comments should
include "comments: [MM]" at the start of the Subject header.
Publication as a Working Group Note does not imply endorsement by the
W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is
inappropriate to cite this document as other than work in progress. Other
documents may supersede this document.
The need for annotating digital image data is recognized in a wide
variety of different applications, covering both professional and personal
usage of image data. At the time of writing, most work done is this area is not
based on Semantic Web technology, either because it predates the Semantic Web
or for other reasons. This document explains the advantages of using Semantic
Web languages and technology for image annotations and provides guidelines for
doing so. It is organized around a number of representative use cases, and a
description of Semantic Web vocabularies and tools that could be used to help
accomplish the task mentioned in the uses cases. The remainder of this
introductory section first gives an overview of image annotation in general,
followed by a short description of the key Semantic Web concepts that are
relevant for image annotation.
Annotating images on a small scale for personal usage can be relatively
simple. The reader should be aware, however, that large scale, industrial
strength image annotation is notoriously complex. Trade offs along several
dimensions make the task difficult:
·
Generic vs
task-specific annotation Annotating images without having a
specific goal or task in mind is often not cost effective: after the target
application has beed developed, it turns out that
images have been annotated using the wrong type of information, or on the wrong
abstraction level, etc. Redoing the annotations is then an unavoidable, but
costly solution. On the other hand, annotating with only the target application in mind may also not be cost effective.
The annotations may work well with that one application, but if the same
metadata is to be reused in the context of other applications, it may turn out
to be too specific, and unsuited for reuse in a
different context. In most situations the range of applications in which the
metadata will be used in the future is unknown at the time of annotation. When
lacking a crystal ball, the best the annotator can do in practice is use an
approach that is sufficiently specific for the application under developement, while avoiding unnessary
application-specific assumptions as much as possible.
·
Manual versus automatic annotation and the
"Semantic Gap" . In
general, manual annotation can provide image desciptions
at the right level of abstraction. It is, however, time consuming and thus
expensive. In addition, it proves to be highly subjective: different human
annotators tend to "see" different things in the same image. On the
other hand, annotation based on automatic feature extraction is relatively fast
and cheap, and free of human bias. It tends to result, however, in image
descriptions that are too low level for many applications. The difference
between the low level feature descriptions provided by image analysis tools and
the high level content descriptions required by the applications is often
referred to, in the literature, as the Semantic
Gap. In the remainder, we will discuss use cases, vocabularies and
tools for both manual and automatic image annotation.
·
Different vocabularies for different types
of metadata. While various classifications of metadata have been
described in the literature, every annotator should at least be aware of the
difference between annotations describing properties of the image itself, and
those describing the subject matter of the image, that is, the properties of
the objects, persons or concepts depicted by the image. In the first category,
typical annotations provide information about title, creator, resolution, image
format, image size, copyright, year of publication, etc. Many applications use
a common, predefined and relatively small vocabulary defining such properties.
Examples include the Dublin Core and VRA Core
vocabularies [add refs]. The second category describes what is depicted by the
image, which can vary wildly with the type of image at hand. As a result, one
sees a large variation in vocabularies used for this purpose. Typical examples
vary from domain-specific vocabularies (for example, with terms that are very
specific for astronomy images, or sport images, etc) to domain-independent ones
(for example, a vocabulary with terms that are sufficiently generic to describe
any news photo). In addition, vocabularies tend to differ in size, granularity,
formality etc. In the remainder, we discuss the above metadata categories. Note
that in the first type it is not uncommon that a vocabulary only defines the
properties and defers the definitions of the values of those properties to
another vocabulary. This is true, for example, for both Dublin Core and VRA Core. This means that typically, in order to annotate a
single image one needs terms from multiple vocabularies.
·
Lack of Syntactic and Semantic
Interoperability Many different file formats and tools for
image annotations are currenty in use. Reusing metdata developed for one set of tools in another is often
hindered by a lack of interoperability. First, different tools use different
file formats, so tool A may not be able to read in the metadata provided by
tool B (syntax-level interoperability). Solving the problem is relatively easy
if the inner structure of both file formats are known
by developing a conversion tool. Second, tool A may assign a different meaning
to the same annotation as tool B does (semantic interoperability). Solving this
problem is much harder and can be done automatically only when the semantics of
the vocabulary used is explicitly defined for both tools.
TO BE DONE [By Jacco]
While much of the current work in this area is not (yet) based on
Semantic Web languages and technology, we believe using this has many potential
advantages.
The late advances in digital technologies (cameras, computers, storage,
communication etc) caused a huge increase of digital multimedia information
captured stored and distributed by personal users over the web. Digital formats
provide now the most cheap, safe and easy way to broadly capture, store and
deliver multimedia content. Most personal users have thousands of photos (from
vacations, parties, traveling, conferences, everyday life etc), usually stored in
several resolutions on the hard disk of their computers in a simple directory
structure without any metadata. Ideally, the user wants to easily access this
content, view it, create presentations, use it in his homepage, deliver it over
the Internet to other people, make part of it accessible for other people or
even sale part of it to image banks etc. But unfortunately, the only way for
this content to be accessed is by browsing the directories, the name of which
usually provides the date and describes with one-two words the original event
captured by the specific photos. Obviously, this access becomes more and more
difficult since the number of photos increases everyday and unfortunately very
soon the content will practically become unaccessible.
The best solution to this problem covering almost all the present and future
uses of the content is the description and archiving of each photo with the aid
of a tool providing a semantic metadata structure using the Semantic Web
technologies (see for example the tools below). Using the above tools, the
users can access the photos with the aid of virtual views (taxonomies),
keywords describing their content or administrative information (like the date
of archiving, the resolution of the photo) etc. And the most important, the
standardization of the metadata format ensures the accessibility of the content
from other people, the sharing and use of it over the web.
A museum in fine arts has asked a specialized company to produce high
resolution digital scans of the most important art works of their collections.
The museum's quality assurance requires the possibility to track when, where
and by whom every scan was made, with what equipment, etc. The museum's
internal IT departement, maintainaing
the underlying image database, needs the size, resolution and format of every
resulting image. It also needs to know the repository ID of the original work
of art. The company developing the museum's website additionally requires
copyright information (that varies for every scan, depending on the age of the
original work of art and the collection it originates from). It also want to
give the users of the website access to the collection, not only based on the
titles of the paintings and names of their painters, but also based on the
topics depicted ('sun sets'), genre ('self portraits'), style
('post-impressionism'), period ('fin de siecle'),
region ('west european').
To be done [by Jane].
Audiovisual archives manage very large multimedia databases. For
instance, INA, the French Audiovisual National Institute, has been archiving TV
documents for 50 years and radio documents for 65 years and stores more than 1
million hours of broadcast programs. The images and sound archives kept at INA
are either intended for professional use (journalists, film directors,
producers, audiovisual and multimedia programmers and publishers, in
A media production house requires several web services in order to organise and implement its projects. Usually, the
pre-production and production start from location, people, image and footage
search and retrieval in order to speed up the process and reduce as much as
possible the cost of the production. For that reason, several multimedia
archives (image and video banks, location management databases, casting houses
etc) provide the above information through the web. Everyday, media producers,
location managers, casting managers etc, are looking in the above archives in
order to find the appropriate resources for their project. The quality of this
search and retrieval process directly affects the quality of the service that
the archives provide to the users. In order to facilitate the above process,
the annotation of image content should make use of the Semantic Web
technologies, also following the multimedia standards in order to be
interoperable with other archives. Using for example the tools described below,
people that archives the content in the media production chain can provide all
the necessary information (administrative, structural and descriptive metadata)
in a standard form (RDF, OWL) that will be easily
accessible for other people over the web. Using the Semantic Web standards, the
archiving, search and retrieval processes will then make use of semantic
vocabularies (ontologies) describing information
concerning the structure of the content from thematic categories to description
of the main objects appearing in the content with its main visual
characteristics etc. In this way, multimedia archives will make their content
easily accessible over the web, providing a unified framework for media
production resource allocation.
Many organizations maintain extremely large-scale image collections. The
National Aeronautics and Space Administration (NASA) is such an example, which
has hundreds of thousands of images, stored in different formats, levels of
availability and resolution, and with associated descriptive information at
various levels of detail and formality. Such an organization also generates
thousands of images on an ongoing basis that are collected and cataloged. Thus,
a mechanism is needed to catalog all the different types of image content
across various domains. Information about both the image itself (e.g., its
creation date, dpi, source) and about the specific
content of the image is required. Additionally, the associated metadata must be
maintainable and extensible so that associated relationships between images and
data can evolve cumulatively. Lastly, management functionality should provide
mechanisms flexible enough to enforce restriction based on content type,
ownership, authorization, etc.
This section needs to be moved to a future section about solutions to
use cases.
One possible solution for such image management requirements is an
annotation environment that enables users to annotate information about images
and/or their regions using concepts in ontologies
(OWL and/or RDFS). More specifically, subject matter experts
will be able to assert metadata elements about images and their specific
content. Multimedia related ontologies can be used to
localize and represent regions within particular images. These regions can then
be related to the image via a depiction/annotation property. This functionality
can be provided, for example, by the MINDSWAP
digital-media ontology (to represent image regions), in conjunction with FOAF (to assert image depictions). Additionally, in order
to represent the low level image features of regions, the aceMedia
Visual Descriptor Ontology can be used.
Existing toolkits, such as PhotoStuff and M-OntoMat-Annotizer, currently provide graphical environments
to accomplish the tasks as defined above. Using such tools, users can load
images, create regions around parts of the image, automatically extract
low-level features of selected regions (via M-OntoMat-Annotizer),
assert statements about the selected regions, etc. Additionally, the resulting
annotations can be exported as RDF/XML, thus allowing
them be shared, indexed, and used by advanced annotation-based browsing (and
searchable) environments.
The "Multimedia Content
Description" standard, widely known as MPEG-7 aims to be the standard
for describing any multimedia content. MPEG-7 standardizes tools or ways
to define multimedia Descriptors (Ds), Description Schemes (DSs) and the relationships between them. The descriptors
correspond to the data features themselves, generally low-level features such
as visual (e.g. texture, camera motion) or audio (e.g. melody), while the
description schemes refer to more abstract description entities. These tools as
well as their relationships are represented using the Description Definition
Language (DDL), the core part of the language.
The W3C XML Schema recommendation has been adopted as the most appropriate
schema for the MPEG-7 DDL. Note that several
extensions (array and matrix datatypes) have been
added in order to satisfy specific MPEG-7 requirements.
The set of MPEG-7 XML Schemas define
1182 elements, 417 attributes and 377 complex types which is usually seen as a
difficulty when managing MPEG-7 descriptions. Moreover, several works have
already pointed out the lack of formal semantics of the standard that could
extend the traditionnal text descriptions into
machine understandable ones. These attempts that aim to bridge the gap between
the multimedia community and the Semantic Web are detailed below.
Link: http://maenad.dstc.edu.au/slittle/mpeg7.owl
Summary: Chronologically the first one, this MPEG-7 ontology was firstly
developped in RDFS [1],
then converted into DAML+OIL, and is now available in
OWL. This is an OWL Full ontology (note: execpt for
the corrections of three small mistakes inside the OWL file. The &xsd;nil
should be replace by &rdf;nil, otherwise it is not OWL valid).
The ontology covers the upper part of the Multimedia Description Scheme
(MDS) part of the MPEG-7 standard. It consists in about 60 classes and 40
properties.
References:
Link: http://elikonas.ced.tuc.gr/ontologies/av_semantics.zip.
Summary: Starting from the previous
ontology, this MPEG-7 ontology covers the full Multimedia Description Scheme
(MDS) part of the MPEG-7 standard. It contains 420 classes and 175 properties.
This is an OWL DL ontology.
References:
Link: http://dmag.upf.edu/ontologies/mpeg7ontos/.
Summary: This MPEG-7 ontology has
been produced fully automatically from the MPEG-7 standard in order to give it
a formal semantics. For such a purpose, a generic mapping XSD2OWL has been
implemented. The definitions of the XML Schema types and elements of the ISO
standard have been converted into OWL definitions according to the table given
in [3]. This ontology could then serve as a top ontology thus easing the
integration of other more specific ontologies such as
MusicBrainz. The authors have also proposed to
transform automatically the XML data (instances of MPEG-7) into RDF triples (instances of this top ontology).
This ontology aims to cover the
whole standard and it thus the most complete one (with respect to the previous
mentioned). It contains finally 2372 classes and 975 properties. This is an OWL
Full ontology since it employs the rdf:Property construct to cope with the fact that there are properties that have
both datatype and object type ranges.
References:
Link: store this ontology on CWI for ease
of reference?
Summary: This ontology is not really an MPEG-7 ontology
since it does not cover the whole standard. It is rather a core audio-visual
ontology inspired by several terminologies, either standardized (like MPEG-7
and TV Anytime) or still under development (ProgramGuideML).
Furthermore, this ontology benefits from the practices of the French INA
institute, the English BBC and the Italian RAI
channels, which have also developed a complete terminology for describing radio
and TV programs.
This core ontology contains currently 1100 classes and 220
properties and it is represented in OWL Full
References:
The MPEG-7 standard is divided into several parts reflecting the various
media one can find in multimedia content. This section focus
on various attempts to design ontologies that
correspond to the visual part of the standard.
Link:http://www.acemedia.org/aceMedia/reference/resource/index.html,
the current version is 9.0.
Summary: The Visual Descriptor Ontology (VDO)
developed within the aceMedia project for semantic
multimedia content analysis and reasoning, contains representations of MPEG-7
visual descriptors and models Concepts and Properties that describe visual
characteristics of objects. By the term descriptor we mean a specific
representation of a visual feature (color, shape, texture etc) that defines the
syntax and the semantics of a specific aspect of the feature. For example, the
dominant color descriptor specifies among others, the number and value of
dominant colors that are present in a region of interest and the percentage of
pixels that each associated color value has. Although the construction of the VDO is tightly coupled with the specification of the MPEG-7
Visual Part, several modifications were carried out in order to adapt to the
XML Schema provided by MPEG-7 to an ontology and the
data type representations available in RDF Schema
References:
Link: http://www.mindswap.org/2005/owl/digital-media.
Summary:
References:
Link: http://www.cs.vu.nl/~laurah/VO/visualWordnetschema2a.rdfs.
Summary:
References:
TO BE DONE: Categorisation of important tools
[by Nikolaos]:
1) Type of content: Jpeg Images,Video etc
2) Type of metadata: Descriptive, administrative, structural etc
3) Format of metadata: OWL, RDF
4) Annotation Level: Extraction of visual characteristics and
association with domain ontology concepts-operation using
ontologies.
5) Operation mode: plug-in, stand-alone
6) Open source: YES, NO
Suggestions by Jane:
7) Collaborative or individual
8) Granularity - file-based or segment-based (and sub-categories of types of segmentation)
9) Threaded or unthreaded (By threaded I mean the ability to
respond or add to a previous annotation and to stagger/structure
the presentation of annotations to reflect this.)
10) Access controlled or open access
Type of
content |
Images |
Type of
metadata |
Administrative |
Format of
metadata |
FOF RDF |
Annotation
Level |
Low |
Operation
mode |
Web Based |
Open source |
No/Flickr API |
Collaborative
or Individual |
Individual |
Granularity |
Segment based |
Threaded or
unthreaded |
Unthreaded |
Access controlled or open access |
Open access |
Type of
content |
Images |
Type of
metadata |
Administrative/Structural |
Format of
metadata |
RDF |
Annotation
Level |
Low |
Operation
mode |
Stand alone |
Open source |
Yes |
Collaborative
or Individual |
Individual |
Granularity |
File based |
Threaded or
unthreaded |
Unthreaded |
Access controlled or open access |
Open access |
Type of
content |
Images and
Videos |
Type of
metadata |
All |
Format of
metadata |
RDF |
Annotation
Level |
High |
Operation
mode |
Stand alone |
Open source |
No |
Collaborative
or Individual |
Collaborative |
Granularity |
Segment based |
Threaded or
unthreaded |
Threaded |
Access controlled or open access |
Open access |
Type of
content |
Images |
Type of
metadata |
Administrative/Structural |
Format of
metadata |
RDF |
Annotation
Level |
Low |
Operation
mode |
Stand alone |
Open source |
Yes |
Collaborative
or Individual |
Individual |
Granularity |
Fie based |
Threaded or
unthreaded |
Unthreaded |
Access controlled or open access |
Open access |
Type of
content |
Images |
Type of
metadata |
All |
Format of
metadata |
RDF |
Annotation
Level |
Low |
Operation
mode |
Stand alone |
Open source |
Yes |
Collaborative
or Individual |
Individual |
Granularity |
Segment based |
Threaded or
unthreaded |
Threaded |
Access controlled or open access |
Open access |
Type of
content |
Images |
Type of
metadata |
Administrative |
Format of
metadata |
RDF |
Annotation
Level |
Low |
Operation
mode |
Web Based |
Open source |
NO |
Collaborative
or Individual |
Individual |
Granularity |
File based |
Threaded or
unthreaded |
Unthreaded |
Access controlled or open access |
Open access |
TO BE DONE: Rewrite this section in terms of solutions to the use cases
based on the vocabularies and tools described above. Template proposed by
Raphael and Jacco:
a) how to localize some parts of target media content
b) how to characterize the annotation link between the annotation
and the media (distinction between work and representation, Ã la VRA)
c) how to distinguish the domain specific part and the multimedia
part of the annotation => different ontologies
d) which annotation tools should be used for which purpose
[the work should by primarily done by use case
owners, helped by others]
TO BE DONE:
Short description and categorisation of important
relevant work
TO
BE DONE: Short description and categorisation of
important projects and events