- From: Paola Di Maio <paola.dimaio@gmail.com>
- Date: Sat, 5 Nov 2022 10:06:39 +0800
- To: W3C AIKR CG <public-aikr@w3.org>
- Message-ID: <CAMXe=SpqUOUau7QNtYoFhboznGScZ3D8yrcr=A3x03yJgf+Ehg@mail.gmail.com>
The article pointed out by Dave [1] describes how to detect deepfakes using viseme and phoneme, which are KR constructs, as explained in [2] [1] Agarwal, Shruti, et al. "Detecting deep-fake videos from phoneme-viseme mismatches." *Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops*. 2020 We describe a technique to detect such manipulated videos by exploiting the fact that the dynamics of the mouth shape – visemes – are occasionally inconsis- tent with a spoken phoneme. We focus on the visemes as- sociated with words having the sound M (mama), B (baba), or P (papa) in which the mouth must completely close in order to pronounce these phonemes. We observe that this is not the case in many deep-fake videos. Such phoneme- viseme mismatches can, therefore, be used to detect even spatially small and temporally localized manipulations. [2] A. Metallinou, C. Busso, S. Lee and S. Narayanan, "Visual emotion recognition using compact facial representations and viseme information," *2010 IEEE International Conference on Acoustics, Speech and Signal Processing*, 2010, pp. 2474-2477, doi: 10.1109/ICASSP.2010.5494893. https://ieeexplore.ieee.org/abstract/document/5494893 *We derive compact facial representations using methods motivated by Principal Component Analysis and speaker face normalization. Moreover, we model emotional facial movements by conditioning on knowledge of speech-related movements (articulation)*.
Received on Saturday, 5 November 2022 02:08:56 UTC