In the effort to add multimedia documents in the Semantic Web multimedia ontologies will play an important role. In contrast to the usual ontologies, multimedia ontologies are formed by three different parts. The first part is the usual ontological part found in all web ontologies, which includes class, property and restriction definitions. The second part is the visual description part, where multimedia documents are given a visual description based on an MPEG-7 visual ontology. At last the third part is the actual raw data of the multimedia document. As it is obvious multimedia ontologies introduce new issues in task of (multimedia) ontology alignment that need to be tackled. The alignment of the usual ontological part of multimedia ontologies can be handled by the usual ontology alignment techniques, which have been developed in the literature. These techniques can be classified into the following categories: \begin{itemize} \item \textbf{Terminological Methods:} Find similarities among entity names through string metric techniques, dictionaries or other linguistic techniques. \item \textbf{Internal Structure:} Refine the similarity among two entities by a portion of the similarities among their properties and their restrictions. \item \textbf{External Structure:} Refine the similarity among two entities by a portion of the similarities among their neighbor entities. \item \textbf{Semantic Methods:} Based on formal languages and model-theoretic semantics. \end{itemize} The visual description of multimedia documents, like images or videos is usually performed with the aid of the MPEG-7 ISO standard. This standard, involves a number of Descriptors with which a characterization of the visual characteristics of the multimedia document can be performed. For example, the Visual part of the MPEG-7 standard includes Descriptors like the Color descriptor by which one can describe the color layout, structure, or specify the dominant color of some region, the Shape descriptor, by which one can describe the contour or the shape of a region and many more. In order to match such descriptions, new techniques have to be devised, that would use the multimedia descriptions to extract similarity degrees between the two entities. The proposed way to extract similarities between two visual descriptions is with a visual matching algorithm. The visual matching algorithms takes as input two sets of visual descriptions, based on the MPEG-7 visual description standard, and returns the similarity of the two descriptions. The proposed way to achieve this is based on a back-propagation neural network with a single hidden layer. The network's input consists of the low-level descriptions of of the two multimedia ontologies, while its output is the normalized distance between the two objects, based on all available descriptors. A training set is constructed using the descriptors of a set of manually labelled atom regions and the descriptors of the corresponding object models. The network is trained under the assumption that the distance of an atom region that belongs to the training set is minimum for the associated object and maximum for all others.