2018 CSUN Abstract on Spoken Presentation from Hakkinen, Mark T on 2021-05-19 (public-pronunciation@w3.org from May 2021)

From: Hakkinen, Mark T <mhakkinen@ets.org>
Date: Wed, 19 May 2021 14:41:04 +0000
To: Janina Sajka <janina@rednote.net>, "public-pronunciation@w3.org" <public-pronunciation@w3.org>
Message-ID: <0A018086-091D-4422-92D4-3D6A52107340@ets.org>

Integrating SSML in HTML for Improved Spoken Rendering by AT: Next Steps

Markku Hakkinen (mhakkinen@ets.org)

Irfan Ali (iali@ets.org)

Cary Supalo (csupalo@ets.org)

Educational Testing Service, Princeton, New Jersey, USA

Extended Abstract

The W3C Speech Synthesis Markup Language (SSML) (W3C, 2004) is seeing growing use in consumer oriented products such as the Amazon Echo and Google Home, and in speech-based services such as Microsoft Cortana. A key benefit of SSML is its use to improve the quality of spoken presentation. However, we have yet to see any significant uptake of SSML by the assistive technology (AT) community. SSML enables content authors to control speech characteristics such as prosody and rate, pronunciation via phonetic text strings, pausing, numeric value handling, and other features. Hakkinen et al (2017) discussed the importance of SSML for ensuring pronunciation accuracy in the context of educational assessment and learning materials, and proposed that it was now time for the assistive technology community to look at ways to enhance the quality of content spoken by Text to Speech Synthesizers (TTS). In lieu of standards-based alternatives, SSML remains the only viable technical standard.

Hakkinen (2016) put forward several possible methods to integrate SSML into HTML content for easy consumption by assistive technologies, and in Hakkinen, et al (2017) presented arguments for and against these methods. With no practical, standard mechanism for including SSML directly in HTML, the authors concluded that an attribute model held the most promise, and this presentation reports on the continuing research and development effort to refine this model.

We propose a technical method for applying SSML speech controls and properties that uses a JSON structure assigned as the value to a new attribute, currently named data-ssml. This attribute may be applied to an HTML element containing textual content, for example, the span element. For assistive technologies that render text into a spoken form using TTS, the AT would be responsible for querying the data-ssml attribute. If the attribute is present, then the AT would construct, from the JSON structure, the SSML markup that will “wrap” the textual content of the element, and then send the resulting XML string to the TTS engine used by the AT. This assumes that the TTS engine in use by the AT natively supports SSML and can consume it.

While data-attributes provide an easy way to incorporate new features into HTML content for use by applications that can query the browser DOM, the question remains as to whether the data-attribute model will be accepted by AT developers. One avenue that remains under exploration is proposing the SSML attribute as a future addition to WAI-ARIA. While not technically necessary, the WAI-ARIA approach has implications for how the SSML structure would be handled in the context of accessibility APIs. We expect this presentation will help spur that discussion.

We have now developed a functional, open source prototype which implements the JSON processing, using JavaScript. The prototype, which provides a functional test interface, runs in modern browsers such as Chrome, Firefox, Safari, and Edge. On platforms where TTS engines natively support SSML, the prototype will process the JSON-specified SSML into XML strings sent to the TTS engine. On platforms where there is no native SSML support, for example Apple Mac OS, we developed an initial solution that maps a core set of SSML functions to the native Apple Speech commands. This latter capability allows content to be authored with SSML and rendered on both systems supporting the SSML standard, as well as systems without SSML, but with functionally equivalent speech markup.

The prototype will be demonstrated with a variety of HTML content examples of SSML usage, on both Windows and Mac platforms. We also plan to demonstrate a “read aloud” tool reference prototype which implements the JSON processing. The next steps, and challenges, in seeking adoption of this proposed approach by content authors and the AT community will be discussed.

References

Hakkinen, M. (2016). SSML In HTML: Issues for Assessment. Retrieved from the Web: https://github.com/mhakkinen/SSML-issues/blob/master/README.md

Hakkinen, M., Supalo, C., Cavalie, C., & Hoffmann, T. (2017). Controlling Pronunciation and Presentation of TTS: It is time to support SSML. Presentation at the CSUN Assistive Technology Conference 2017, San Diego, CA.

W3C (2004). SSML 1.0 Specification. Retrieved from the Web: https://www.w3.org/TR/speech-synthesis/

________________________________

This e-mail and any files transmitted with it may contain privileged or confidential information. It is solely for use by the individual for whom it is intended, even if addressed incorrectly. If you received this e-mail in error, please notify the sender; do not disclose, copy, distribute, or take any action in reliance on the contents of this information; and delete it from your system. Any other use of this e-mail is prohibited.

Thank you for your compliance.

________________________________

Received on Wednesday, 19 May 2021 14:41:36 UTC