LLF UC -> SMR UC from Ioannis Pratikakis on 2006-10-12 (public-xg-mmsem@w3.org from October 2006)

From: Ioannis Pratikakis <ipratika@iit.demokritos.gr>
Date: Thu, 12 Oct 2006 16:21:43 +0300
To: <public-xg-mmsem@w3.org>
Message-ID: <017401c6ee01$5c975f40$5c03a8c0@iit.demokritos.gr>
Dear colleagues,



In the following, you may find a development of the currently called "Low level feature extraction" UC.

I do not feel very comfortable with the title.

I suggest to turn this into "Semantic Multimedia retrieval".

Any other suggestion is welcome.

My effort focuses on highlighting the required knowledge representation and subsequent semantic interoperabilty.

I would be very grateful in your frutiful feedback and more particularly to your active participation that may enhance not only the "motivating examples" but also aid towards underlying interesting corresponding "possible solutions".



Best regards,



Ioannis



+======================================================+
Ioannis PRATIKAKIS, Dipl. Eng., Ph.D
Research Scientist
 
Computational Intelligence Laboratory (http://www.iit.demokritos.gr/cil/)
Institute of Informatics and Telecommunications
National Center for Scientific Research "Demokritos"
P.O. BOX   60228
GR-153 10 Agia Paraskevi, Athens, Greece.
Tel:     +30-210-650 3183
Fax:    +30-210-653 2175
E-mail: ipratika@iit.demokritos.gr
+======================================================+






--------------------------------------------------------------------------------






Use Case : Semantic Multimedia retrieval

 

 

Introduction

 

For multimedia documents retrieval, using only low-level features as in the case of "retrieval by example", on the one hand, one gets the advantage of an automatic computation of the required low-level features but on the other hand, it is inadequate to answer to high-level queries. For this, an abstraction of the high level multimedia content description is required. In particular, the MPEG-7 standard which provides metadata descriptors for structural and low-level aspects of multimedia documents, needs to be properly linked to domain-specific ontologies that model high-level semantics. 

 

Furthermore, since the provision of cross-linking between different media types or corresponding modalities supports a rich scope for inferencing a semantic interpretation, interoperability between different single media schemes is an important issue. 

This is due to the fact that each single modality (i) can inference particular high level semantics with different degrees of confidence, (ii) can be supported by a world modelling (or ontologies) where different relationships exist, e.g. in an image one can attribute spatial relationships while in a video sequence spatio-temporal relationships can be attained, and (iii) can have different role in a cross-modality fashion - which modality triggers the other, e.g. to identify that a particular photo in a Web page depicts person X, we first extract information from text and thereafter we cross-validate by the corresponding information extraction from the image.

 

Both of the above concerns, either the single modality or the cross-modality require semantic interoperability which will support a knowledge representation as well as a multimedia analysis part that can be directed towards an automatic semantics extraction from multimedia.

 

Motivating example

 

In the following, current pitfalls with respect to the desired semantic interoperability are given via examples.

 

Example 1

Linking of low-level features to high-level semantics can be obtained by following two main trends: (i) using machine learning techniques to infer the required mapping and (ii) using ontology-driven approaches to both guide the semantic analysis and infer high-level concepts using reasoning.

 

In both above trends, it is appropriate for at least a certain granularity of concepts to produce concept detectors after a training phase based upon feature sets. To enable a semantic interoperability it is not adequate to permit the exchange of low-level features between different users. This holds because using a low level description (eg. MPEG-7) is not meaningful since 

 

-          there is a lack of intuitive interpretation. The MPEG-7 descriptors are represented as a vector with numerical values

-          In the ideal case, a system would only be able to support a content-based matching in a large collection of a single modality.

-          in the case that a user needs to save particular objects in an image instead of the complete image content, then it is required that a user should know how to select the particular object in the image. In the case that the selection will not be appropriate (eg. Include neighbour objects with different colors than those of the object) then the descriptor will mislead the analysis part.

 

Example 2

 

Single modalities (text, image, video, 3D graphics, audio) can inference particular high level semantics with different degrees of confidence which range from zero to 100%. When author X would like to declare a feeling (happiness, sadness, anger, etc) is highly possible that he uses modalities with the following priority audio, text, video, image. This priority is required to be imposed in order to enhance the semantic compatibility between the concept and the modality. The need for the imposed priority is further enhanced by taking into account that different modalities can be supported by a world modelling (or ontologies) where different relationships exist, e.g. in an image one can attribute spatial relationships while in a video sequence spatio-temporal relationships can be attained.

 

Example 3

 

A similar principle with the one expressed in Example 2 concerns cross-modality content that a user has to deal with. It is again motivated from the analysis part that for particular concepts there should be a priority regarding the particular modality that concept extraction is applied each time in a sequel.

For example, to enhance recognition of a face of a particular person in an image of a Web page, that is a very difficult task, it seems natural that there should be an inference out of the textual content first and thereafter it could be validated by the semantics extracted from the image.

 

 

 

Possible solution

 

Example 1 (The solution)


 

 

 Example 2 (The solution)

 

 

Example 3 (The solution)

 

 

 

Conclusions

 

 

References
Received on Thursday, 12 October 2006 13:22:12 UTC