In the effort to add multimedia documents in the Semantic Web
multimedia ontologies will play an important role. In contrast to
the usual ontologies, multimedia ontologies are formed by three
different parts. The first part is the usual ontological part
found in all web ontologies, which includes class, property and
restriction definitions. The second part is the visual description
part, where multimedia documents are given a visual description
based on an MPEG-7 visual ontology. At last the third part is the
actual raw data of the multimedia document. As it is obvious
multimedia ontologies introduce new issues in task of (multimedia)
ontology alignment that need to be tackled.

The alignment of the usual ontological part of multimedia
ontologies can be handled by the usual ontology alignment
techniques, which have been developed in the literature. These
techniques can be classified into the following categories:
\begin{itemize}
    \item \textbf{Terminological Methods:} Find similarities among
    entity names through string metric techniques, dictionaries or
    other linguistic techniques.
    \item \textbf{Internal Structure:} Refine the similarity among
    two entities by a portion of the similarities among their
    properties and their restrictions.
    \item \textbf{External Structure:} Refine the similarity among
    two entities by a portion of the similarities among their
    neighbor entities.
    \item \textbf{Semantic Methods:} Based on formal languages and
    model-theoretic semantics.
\end{itemize}

The visual description of multimedia documents, like images or
videos is usually performed with the aid of the MPEG-7 ISO
standard. This standard, involves a number of Descriptors with
which a characterization of the visual characteristics of the
multimedia document can be performed. For example, the Visual part
of the MPEG-7 standard includes Descriptors like the Color
descriptor by which one can describe the color layout, structure,
or specify the dominant color of some region, the Shape
descriptor, by which one can describe the contour or the shape of
a region and many more. In order to match such descriptions, new
techniques have to be devised, that would use the multimedia
descriptions to extract similarity degrees between the two
entities. The proposed way to extract similarities between two
visual descriptions is with a visual matching algorithm.

The visual matching algorithms takes as input two sets of visual
descriptions, based on the MPEG-7 visual description standard, and
returns the similarity of the two descriptions. The proposed way
to achieve this is based on a back-propagation neural network with
a single hidden layer. The network's input consists of the
low-level descriptions of of the two multimedia ontologies, while
its output is the normalized distance between the two objects,
based on all available descriptors. A training set is constructed
using the descriptors of a set of manually labelled atom regions
and the descriptors of the corresponding object models. The
network is trained under the assumption that the distance of an
atom region that belongs to the training set is minimum for the
associated object and maximum for all others.