RE: Media synchronization - wiki page from Scott Hollier on 2020-09-30 (public-rqtf@w3.org from September 2020)

From: Scott Hollier <scott@hollier.info>
Date: Wed, 30 Sep 2020 06:40:17 +0000
To: "public-rqtf@w3.org" <public-rqtf@w3.org>
Message-ID: <BN6PR01MB220969FC96C34D4DA1D3C9A6DC330@BN6PR01MB2209.prod.exchangelabs.com>

To the RQTF

Following on from our discussion last week, I thought this was particularly interesting that might help to build on Steve’s discussion.

Source: https://cdn.ttgtmedia.com/searchUnifiedCommunications/downloads/VideoConf_CH07.pdf


*** BEGIN QUOTE
Understanding Lip Sync Skew Lip sync is the general term for audio/video synchronization, and literally refers to the fact that visual lip movements of a speaker must match the sound of the spoken words. If the video and audio displayed at the receiving endpoint are not in sync, the misalignment between audio and video is referred to as skew. Without a mechanism to ensure lip sync, audio often plays ahead of video, because the latencies involved in processing and sending video frames are greater than the latencies for audio. Human Perceptions User-perceived objection to unsynchronized media streams varies with the amount of skew— for instance, a misalignment of audio and video of less than 20 milliseconds (ms) is considered imperceptible. As the skew approaches 50 ms, some viewers will begin to notice the audio/video mismatch but will be unable to determine whether video is leading or lagging audio. As the skew increases, viewers detect that video and audio are out of sync and can also determine whether video is leading or lagging audio. At this point, the video/audio offset distracts users from the   video conference. When the skew approaches one second, the video signal provides no benefit— viewers will ignore the video and focus on the audio. Human sensitivity to skew differs greatly from person to person. For the same audio/video skew, one person might be able to detect that one stream is clearly leading another stream, whereas another person might not be able to detect any skew at all. A research paper published by the IEEE reveals that most viewers are more sensitive to audio/ video misalignment when audio plays before the corresponding video, because hearing the spoken word before seeing the lips move is more “unnatural” to a viewer (Blakowski and Steinmetz 1996). Sensitivity to skew is also determined by the frame rate and resolution: Viewers are more sensitive to skew when watching higher video resolution or higher frame rate. Report IS-191 issued by the Advanced Television Systems Committee (ATSC) recommends guidelines for maximum skew tolerances for broadcast systems to achieve acceptable quality. The guidelines model the end-to-end path by assuming that a single encoder at the distribution center receives both audio and video streams, digitizes the streams, assigns time stamps, encodes the streams, and then sends the encoded data over a network to a receiver. The guidelines specify that on the sending side, at the input to the encoder, the audio should not lead the video by more than 15 ms and should not lag the video by more than 45 ms. This possible lead or lag might arise from uncertainty in the latencies through the digitizing/capture hardware and occurs before the encoder assigns time stamps to the digitized media streams. At the receiving side, the receiver plays the audio and video streams according to time stamps assigned by the encoder. But again, there is an uncertainty in the latency of each stream through the playout hardware. The guidelines stipulate that for each stream, this uncertainty should not exceed ±15 ms; this tolerance is an absolute tolerance that applies to each stream. Based on these guidelines, two requirements emerge for acceptable lip sync tolerance: ■ Criterion for leading audio—In the worst-case-permitted scenario, audio leads video at the input to the encoder by 15 ms. The receiver plays the audio stream too far ahead by 15 ms while playing the video stream too far behind by 15 ms.
*** END QUOTE

Also here’s an article that builds on Janina’s comments a few weeks ago about language interpretation. It’s an a PDF and I’m having trouble accessing all its contents, but he abstract looks interesting.
https://www.researchgate.net/publication/257436740_Assessing_the_importance_of_audiovideo_synchronization_for_simultaneous_translation_of_video_sequences


Thanks everyone,

Scott.


[Scott Hollier logo]Dr Scott Hollier
Digital Access Specialist
Mobile: +61 (0)430 351 909
Web: www.hollier.info<http://www.hollier.info/>

Technology for everyone

Keep up with digital access news by following @scotthollier on Twitter<https://twitter.com/scotthollier> and subscribing to Scott’s newsletter<mailto:newsletter@hollier.info?subject=subscribe>.

From: White, Jason J <jjwhite@ets.org>
Sent: Wednesday, 30 September 2020 4:14 AM
To: public-rqtf@w3.org
Subject: Media synchronization - wiki page

Dear colleagues,

I have updated the wiki page to correct markup issues that I unintentionally introduced earlier. Also, as I recall, there are further references to add that were discussed at the meeting last week.
https://www.w3.org/WAI/APA/task-forces/research-questions/wiki/Media_Synchronization_Requirements


We should probably document our observations on that page as well.


________________________________

This e-mail and any files transmitted with it may contain privileged or confidential information. It is solely for use by the individual for whom it is intended, even if addressed incorrectly. If you received this e-mail in error, please notify the sender; do not disclose, copy, distribute, or take any action in reliance on the contents of this information; and delete it from your system. Any other use of this e-mail is prohibited.


Thank you for your compliance.

________________________________
Attachments

image/gif attachment: image001.gif
Received on Wednesday, 30 September 2020 06:40:34 UTC