Natural Language Generation, SSML, and Prosody

Voice Browser Working Group,

Greetings.  XML and SSML are more expressive than plain text and, with regard to natural language generation, it is often advantageous to generate XML output, such as SSML.  During each stage of natural language generation, there exists information which can annotate or add to the XML, SSML output.  Such SSML can then facilitate more natural sounding speech synthesis.  Specific approaches to natural language generation may have additional data with which to annotate or add to SSML content for synthesis.

For example, anaphoric references can be indicated in XML including with use of xml:id and ssml:ref attributes. Such referential, anaphoric, content could facilitate synthesizers' modulation of prosody in ways so as to connect nouns, pronouns, references, and referential phrases.

Combined with other data, including semantic and pragmatic, data available during stages of natural language generation, content can be synthesized in more naturally sounding ways.  Another prosody-related topic is concept introduction and concept reference.  In didactic contexts, an introduced concept is often subsequently referenced in subsequent explanatory content.  As my research includes intelligent tutoring systems, the prosody of explanatory, didactic, content is of a particular interest to me.  The introduction of and referencing of concepts, terminology, are but one example of how the modulation of voice, prosody, has functional roles during didactic natural language use scenarios.

Didactic Prosody and Notetaking in L1 and L2 ( indicates that "occurrence of new information generally related to didactic accents and pauses" and that the "correlation is not constant, some units are more didactically marked than others because of didactic important that each unit contains."

At the Speech Prosody 6th International Conference, a paper Perception of Spontaneous Narrative Structure ( indicates that "one of the most important structuring devices in spoken discourse is prosody" and that "speakers often use prosody to structure the flow of information in discourse."

Towards specific ideas to enhance SSML, to get the conversation started, a <section> element could be added to SSML, resulting in <section>, <p>, <s>, and <w> structural elements.  Such <section> elements could be nested, as per an outline structure.  An extended set of attributes could be described for such structural elements, utilizing information including from stages of natural language generation, to facilitate enhanced prosodic synthesis.

Kind regards,
Adam Sobieski 		 	   		  

Received on Tuesday, 9 October 2012 10:07:49 UTC