Image annotation on the Semantic Web
Editors' Draft $Date: 2005/09/30 15:04:34 $ $Revision: 1.13 $
$Authors: jrvosse, gstam, raphael$
- This version:
- N/A
- Latest version:
- N/A
- Previous version:
- N/A
- Editors:
- TO BE REVISED AT THE END
- Giorgos Stamou, IVML, National Technical University of Athens,
<gstam@softlab.ece.ntua.gr>
- Jacco van Ossenbruggen, Center for Mathematics and Computer
Science (CWI), <Jacco.van.Ossenbruggen@cwi.nl>
- Raphaël Troncy, Center for Mathematics and Computer Science
(CWI), <Raphael.Troncy@cwi.nl>
- Jeff Pan, University of Manchester, <pan@cs.man.ac.uk>
- Additional Contributors and Special Thanks to:
- TO BE REVISED AT THE END
- Jane Hunter, DSTC, <jane@dstc.edu.au>
- Guus Schreiber, VU,<schreiber@cs.vu.nl>
- John Smith, IBM, <rsmith@watson.ibm.com>
- Jeremy Caroll, HP, <jjc@hplb.hpl.hp.com>
- Vassilis Tzouvaras, IVML, National Technical University of
Athens, <tzouvaras@image.ece.ntua.gr>
- Nikolaos Simou, IVML, National Technical University of Athens,
<nsimou@image.ece.ntua.gr>
- Christian Halaschek-Wiener, UMD, <halasche@cs.umd.edu>
Copyright
© 2003 W3C® (MIT, ERCIM,
Keio), All Rights Reserved. W3C liability,
trademark
and document
use rules apply.
Many applications that involve multimedia content make use of some
form of metadata that describe this content. The present
document aims at providing guidelines for using Semantic Web languages
and technologies in order to create, store, manipulate, interchange and
process
image metadata. It gives a number of use cases to exemplify the use of
Semantic Web technology for image annotation, an overview of RDF and
OWL vocabularies developed for this task and an overview of relevant
tools.
Note that many approaches to image annotation predate Semantic Web
technology. Interoperability between these technologies and RDF and
OWL-based approaches will be addressed in a future document.
Institutions and organizations with research and standardization
activities in the area of multimedia, professional (museums, libraries,
audiovisual archives, media production and broadcast industry, image
and video banks) and non-professional (end-users) multimedia
annotators.
- Collect currently used vocabularies for multimedia annotations
(like Dublin Core, VRA, ...)
- Provide use cases with examples of multimedia annotations using
the above vocabularies
- Investigate existing tools and other formats (ID3, EXIF, XMP etc)
Status of this document
This is a public (WORKING DRAFT) Working
Group Note produced by the Multimedia
Annotation in the Semantic Web Task Force of the W3C Semantic Web Best
Practices & Deployment Working Group, which is part of the W3C Semantic Web activity.
Discussion of this document is invited on the public mailing list public-swbp-wg@w3.org (public
archives). Public comments should include "comments: [MM]" at the
start of the Subject header.
Publication as a Working Group Note does not imply endorsement by
the W3C Membership. This is a draft document and may be updated,
replaced or obsoleted by other documents at any time. It is
inappropriate to cite this document as other than work in progress.
Other documents may supersede this document.
TO BE DONE: Corrections, delete and add material
Before starting any image annotation project, one should be aware
that image annotation is notoriously difficult. Trade offs along
several dimensions make the task difficult:
-
Manual versus automatic annotation and the "Semantic Gap"
. In general, manual annotation can provide image desciptions
at the right level of abstraction. It is, however, time consuming and
thus expensive. In addition, it proves to be highly subjective:
different human annotators tend to "see" different things in the same
image. On the other hand, annotation based on automatic feature
extraction is relatively fast and cheap, and free of human bias. It
tends to result, however, in image descriptions that are too low level
for many applications. The difference between the low level feature
descriptions provided by image analysis tools and the high level
content descriptions required by the applications is often referred to,
in the literature, as the Semantic Gap . In the remainder,
we will
discuss use cases, vocabularies and tools for both manual and automatic
image annotation.
-
Different vocabularies for different types of metadata.
While various classifications of metadata have been described in the
literature, every annotator should at least be aware of the difference
between annotations describing properties of the image itself, and
those describing the subject matter of the image, that is, the
properties of the objects, persons or concepts depicted by the image.
In the first category, typical annotations provide information about
title, creator, resolution, image format, image size, copyright, year
of publication, etc. Many applications use a common, predefined and
relatively small vocabulary defining such properties. Examples include
the Dublin Core and VRA Core vocabularies [add refs]. The second
category describes what is depicted by the image, which can vary wildly
with the type of image at hand. As a result, one sees a large variation
in vocabularies used for this purpose. Typical examples vary from
domain-specific vocabularies (for example, with terms that are very
specific for astronomy images, or sport images, etc) to
domain-independent ones (for example, a vocabulary with terms that are
sufficiently generic to describe any news photo). In addition,
vocabularies tend to differ in size, granularity, formality etc. In the
remainder, we discuss the above metadata categories. Note that in the
first
type it is not uncommon that a vocabulary only defines the properties
and defers the definitions of the values of those properties to another
vocabulary. This is true, for example, for both Dublin Core and VRA
Core. This means that typically, in order to annotate a single image
one uses
terms from multiple vocabularies.
2. Use Cases
Use case: Cultural Heritage
A museum in fine arts has asked a specialized company to produce
high resolution digital scans of the most important art works of their
collections. The museum's quality assurance requires the possibility to
track when, where and by whom every scan was made, with what equipment,
etc. The museum's internal IT departement, maintainaing the underlying
image database, needs the size, resolution and format of every
resulting image. It also needs to know the repository ID of the
original work of art. The company developing the museum's website
additionally requires copyright information (that varies for every
scan, depending on the age of the original work of art and the
collection it originates from). It also want to give the users of the
website access to the collection, not only based on the titles of the
paintings and names of their painters, but also based on the topics
depicted ('sun sets'), genre ('self portraits'), style
('post-impressionism'), period ('fin de siecle'), region ('west
european').
Use case: Television news archive
Audiovisual archives manage very large multimedia databases. For
instance, INA, the French Audiovisual National Institute, has been
archiving TV documents for 50 years and radio documents for 65 years
and stores more than 1 million hours of broadcast programs. The
images and sound archives kept at INA are either intended for
professional use (journalists, film directors, producers, audiovisual
and multimedia programmers and publishers, in France and worldwide)
or communicated for research purposes (for a public of students,
research workers, teachers and writers). In order to allow an
efficient access to the data stored, most of the parts of these video
documents are described and indexed by their content. The global
multimedia information system should then be fine-grain enough
detailed to support some very complex and precise queries. For
example, a journalist or a film director client might ask for an
excerpt of a previously broadcasted program showing the first goal of
a given football player in its national team, scored with its
head. The query could additionally contain some more technical
requirements such that the goal action should be available according
to both the front camera view and the reverse angle camera
view. Finally, the client might or might not remember some general
information about this football game, such that the date, the place
and the final score.
Use case: Media Production Services
A media production house requires several web services in order to
organise and implement its projects. Usually, the pre-production and
production start from location, people, image and footage search and
retrieval in order to speed up the process and reduce as much as
possible the cost of the production. For that reason, several
multimedia archives (image and video banks, location management
databases, casting houses etc) provide the above information through
the web. Everyday, media producers, location managers, casting managers
etc, are looking in the above archives in order to find the appropriate
resources for their project. The quality of this search and retrieval
process directly affects the quality of the service that the archives
provide to the users. In order to facilitate the above process, the
annotation of image content should make use of the Semantic Web
technologies, also following the multimedia standards in order to be
interoperable with other archves, thus providing a unified framework
for media production resource allocation.
The "Multimedia Content Description"
standard, widely known as MPEG-7 aims to be the standard for describing
any multimedia content. MPEG-7 standardizes tools or ways to
define multimedia Descriptors (Ds), Description Schemes
(DSs) and the relationships between them. The descriptors correspond to
the data features themselves, generally low-level features such as
visual (e.g. texture, camera motion) or audio (e.g. melody), while the
description schemes refer to more abstract description entities. These
tools as well as their relationships are represented using the Description
Definition Language (DDL), the core part of the language. The W3C
XML Schema recommendation has been adopted as the most appropriate
schema for the MPEG-7 DDL. Note that several extensions (array and
matrix datatypes) have been added in order to satisfy specific MPEG-7
requirements.
The set of MPEG-7 XML Schemas define 1182
elements, 417 attributes and 377 complex types which is usually seen as
a difficulty when managing MPEG-7 descriptions. Moreover, several works
have already pointed out the lack of formal semantics of the standard
that could extend the traditionnal text descriptions into machine
understandable ones. These attempts that aim to bridge the gap between
the multimedia community and the Semantic Web are detailed below.
Link: http://maenad.dstc.edu.au/slittle/mpeg7.owl
Summary: Chronologically the first one, this MPEG-7 ontology was
firstly developped in RDFS [1], then converted into DAML+OIL, and is
now available in OWL. This is an OWL Full ontology (note: execpt for
the corrections of three small mistakes inside the OWL file. The &xsd;nil should be replace by &rdf;nil, otherwise it is not OWL valid).
The ontology covers the upper part of the Multimedia Description
Scheme (MDS) part of the MPEG-7 standard. It consists in about 60
classes and 40 properties.
References:
Link: http://elikonas.ced.tuc.gr/ontologies/av_semantics.zip.
Summary: Starting from the previous ontology, this MPEG-7 ontology
covers the full Multimedia Description Scheme (MDS) part of the
MPEG-7 standard. It contains 420 classes and 175 properties. This is an
OWL DL ontology.
References:
Link: http://dmag.upf.edu/ontologies/mpeg7ontos/.
Summary:
This MPEG-7 ontology has been produced fully automatically from the
MPEG-7 standard in order to give it a formal semantics. For such a
purpose, a generic mapping XSD2OWL has been implemented. The
definitions of the XML Schema types and elements of the ISO standard
have been converted into OWL definitions according to the table given
in [3]. This ontology could then serve as a top ontology thus easing
the integration of other more specific ontologies such as MusicBrainz.
The authors have also proposed to transform automatically the XML data
(instances of MPEG-7) into RDF triples (instances of this top ontology).
This
ontology aims to cover the whole standard and it thus the most complete
one (with respect to the previous mentioned). It contains finally 2372
classes and 975 properties. This is an OWL Full ontology since it
employs the rdf:Property construct to cope
with the fact that there are properties that have both datatype and
object type ranges.
References:
- [3] Semantic Integration and Retrieval of Multimedia Metadata: R.
Garcia and O. Celma. In Proc. of the 5th International Workshop on
Knowledge Markup and Semantic Annotation (SemAnnot 2005) to be held
with ISWC 2005, Galway, Ireland, 7 November 2005.
Link: store this ontology on CWI for ease of
reference?
Summary: This
ontology is not really an MPEG-7 ontology since it does not cover the
whole standard. It is rather a core audio-visual ontology inspired by
several terminologies, either standardized (like MPEG-7 and TV Anytime)
or still under development (ProgramGuideML). Furthermore, this ontology
benefits from the practices of the French INA institute, the English
BBC and the Italian RAI channels, which have also developed a complete
terminology for describing radio and TV programs.
This core ontology contains currently
1100 classes and 220 properties and it is represented in OWL Full
References:
- [4] Designing and Using an
Audio-Visual Description Core Ontology: A. Isaac and R. Troncy. In Workshop
on Core Ontologies in Ontology Engineering held in conjunction with the
14th International Conference on Knowledge Engineering and Knowledge
Management (EKAW'04), Whittlebury Hall, Northamptonshire, UK, 8
October 2004.
- [5] Integrating
Structure and Semantics into Audio-visual Documents: R. Troncy. In Proc.
of the 2nd International Semantic Web Conference (ISWC'03), LNCS
2870, pages 566-581, Sanibel Island, Florida, USA, 21-23 October 2003.
The MPEG-7 standard is divided into several
parts
reflecting the various media one can find in multimedia content. This
section focus on various attempts to design ontologies that correspond
to the visual part of the standard.
Link: http://www.acemedia.org/aceMedia/reference/resource/index.html,
the current version is 9.0.
Summary:
The Visual Descriptor Ontology (VDO) developed within the aceMedia
project for semantic multimedia content analysis and reasoning,
contains representations of MPEG-7 visual descriptors and models
Concepts and Properties that describe visual characteristics of
objects. By the term descriptor we mean a specific representation of a
visual feature (color, shape, texture etc) that defines the syntax and
the semantics of a specific aspect of the feature. For example, the
dominant color descriptor specifies among others, the number and value
of dominant colors that are present in a region of interest and the
percentage of pixels that each associated color value has. Although the
construction of the VDO is tightly coupled with the specification of
the MPEG-7 Visual Part, several modifications were carried out in order
to adapt to the XML Schema provided by MPEG-7 to an ontology and the
data type representations available in RDF Schema
References:
- [6] Semantic
Annotation of Images and Videos for Multimedia Analysis:
S. Bloehdorn, K. Petridis, C. Saathoff, N. Simou, V. Tzouvaras, Y.
Avrithis, S. Handschuh, I. Kompatsiaris, S. Staab, and M. G. Strintzis.
In Proc. of the 2nd European Semantic Web Conference (ESWC 2005),
Heraklion, Greece, May 2005.
Link: http://www.mindswap.org/2005/owl/digital-media.
Summary:
References:
- [7]
A Flexible Approach for Managing Digital Images on the Semantic Web: C.
Halaschek-Wiener, A. Schain, J. Golbeck, M. Grove, B. Parsia and J.
Hendler. In Proc. of the 5th International Workshop on Knowledge
Markup and Semantic Annotation (SemAnnot 2005) to be held with ISWC 2005,
Galway, Ireland, 7 November 2005.
Link: http://www.cs.vu.nl/~laurah/VO/visualWordnetschema2a.rdfs.
Summary:
References:
- [8] Building a Visual Ontology for Video Retrieval: L. Hollink,
M. Worring and G. Schreiber. In Proc. of the ACM Multimedia,
Singapore, November 2005.
4. Examples of image annotations on the
Semantic Web
TO BE DONE: Short description and categorisation of the image
annotations
- Photo annotation and social networking
- Introduction & background reading (see Easy
Image Annotation for the Semantic Web, ILRT Tech report)
- FOAF
co-depiction "Co-depiction is simply the state of being depicted in
the same picture as someone else. We're cataloguing this using FOAF RDF
documents, sharing and collecting these in the Web, as a way of
documenting in a visual way some connections between people."
- w3photo "envisioins a
royalty-free archive of conference pictures from WWW1 to Today --
searchable by the Semantic Web and ready for your tools". It uses
various vocabulary, including Dublin Core, FOAF, CYC, Creative Commons,
FotoNotes etc.
- CONFOTO "is an
experimental sharing and annotation service for conference photos. It
utilizes common RDF vocabularies (dc, foaf, rev, cc, ical, w3photo) to
combine simple tagging with rich annotations (e.g. depicted persons,
related events, ratings). RDF data is accessible via SPARQL, URIQA, or
a link at the bottom of each page."
- FotoNotes "The goal of
the Fotonotes specification is to make it significantly easier for
individuals and groups to share meaningful information about (a) what
is visually depicted within the photograph and (b) what is contextually
(and/or personally) significant about what is (or is not) visually
represented."
- Other image annotation projects
- Combining RDF and MPEG7
- Introduction and background reading (IEEE Multimedia papers,
Part I
and Part
II)
- Jane Hunter et al on annotation of fusion cell images (see SWWS
2001 paper, ISWC04
paper, etc)
- Troncy et al on combining XML and RDF for audio visual
archiving within INA (see ISWC2003
paper) (note: AV-annotation is also relevant for images)
- Chrisa Tsinaraki et al on MPEG-7 and OWL (Coupling OWL with
MPEG-7 and TV-Anytime for Domain-specific Multimedia Information
Integration and Retrieval, see
RIAO 2004 paper)
- Using RDF for describing visual resources in the art domain
- Embedding RDF image annotations in other formats
5. Tools
TO BE DONE: Short description and categorisation of important tools
- flickr2rdf
This Web-based service uses Flickr API to extract
metadata from Flickr's photo repository, and generates an RDF
description.Flickr is an online photo management and sharing
application in which users can upload their photos and
also annotate them (http://www.flickr.com/).The
flick2rdf
Web-based service converts mainly the tags of a flickr image
metadata to FOAF (Friend
of a Friend)
RDF by Masahide Kanzaki.
- jpegRDF
Open source java tool by Norman Walsh. JpegRDF reads and manipulates
RDF metadata stored in the comment
section of JPEG images. It can extract, query, and augment the data.
Manipulating JPEG images with jpegrdf does not modify the actual image
data or any other sections of the file.Ending JpegRDF can also be used
to convert
EXIF to RDF.
- M-OntoMat-Annotizer
(M
stands for Multimedia) is a user-friendly tool developed inside the aceMedia project that allows the
semantic annotation of images and videos for multimedia analysis and
retrieval. It is an extension of the CREAM (CREAting Metadata for the
Semantic Web) framework and its reference implementation, OntoMat-Annotizer.
The Visual Descriptor Extraction Tool (VDE) was developed as a
plug-in to OntoMat-Annotizer and is the core component for extending
its capabilities and supporting the initialization and linking of
RDF(S) domain ontologies with low-level MPEG-7 visual descriptors.
- PHP JPEG
Metadata Toolkit The PHP JPEG Metadata Toolkit is a library of
functions which allows manipulation of many types of metadata that
reside in a JPEG image file. It's main advantages are that it has been
tested
with over 450 popular digital cameras, it provides access to lots of
metadata
for which php has no built in support, it works with many files that
have corrupted metadata and it can also work with PHP4 and does
not require
the enable exif extension.
- PhotoStuff.
This image annotation tool for the Semantic Web allows to annotate
images
and contents of specific regions in images according to several OWL
ontologies of any domain. Moreover, the metadata embedded inside the
JPEG
files are converted into RDF. The annotations can then be published and
shared on the Web.
- SCHEMA This
tool integrates a non-normative part of the
MPEG-7 standard and the MPEG-7 XM software, for extracting, coding and
storing in the database standardized descriptors based on the output of
the analysis modules. The system can also support high level (semantic)
descriptors and the integration of visual media indexing and retrieval
with other modalities (like text and audio based indexing and
retrieval).
- SWAD
The tool is written in Javascript and uses RESTful web services to
access remote information. It is designed to be a quick and easy means
of creating structured information about images, including who or what
is depicted in the image; where and when it was created; creator and
licensing information. The aim is to create and enable the reuse of
alternative formats for both text and images for use in an
accessibility context, although the potential application is much wider.
TO BE DONE: Short description and categorisation of important
relevant work
- flickr
- EXIF "stands for
Exchangeable Image File Format, and is a standard for storing
interchange information in image files, especially those using JPEG
compression. Most digital cameras now use the EXIF format. The format
is part of the DCF standard created by JEITA to encourage
interoperability between imaging devices."
- Getty images collection,
annotations and vocabularies
- IconClass iconographic
classification system, thesaurus for describing icons and other visual
art
- Mark Davis's work on Media
Streams
- IBM's MPEG-7 work for
TRECvid
TO BE DONE: Short description and categorisation of important
projects and events
- Thanks to ...
Appendix A. Informative References
[Hunter]
[Stamou05]
G. Stamou and S. Kollias (eds).
Multimedia Content and the Semantic Web: Methods, Standards and Tools.
John Wiley & Sons Ltd, 2005.
[Troncy2003]
[Ossenbruggen04]
Jacco van Ossenbruggen, Frank Nack, and
Lynda Hardman. That Obscure Object of Desire: Multimedia Metadata on
the Web (Part I). In: IEEE Multimedia 11(4), pp. 38-48 October-December
2004
[Ossenbruggen05]
Frank Nack, Jacco van Ossenbruggen, and
Lynda Hardman/ That Obscure Object of Desire: Multimedia Metadata on
the Web (Part II). In: IEEE Multimedia 12(1), pp. 54-63 January-March
2005