HTML Speech XG
Proposed Protocol Approach

EMMA examples etc (M. Johnston)

8/15/11

Abstract

Multimodal interfaces enable users to interact with web applications using multiple different modalities. The HTML Speech protocol, and associated HTML Speech API, are designed to enable a common multimodal user experience combining spoken and graphical interaction across browsers. The specific goal of the HTML Speech protocol is to enable a web application to utilize the same network-based speech resources regardless of the browser used to render the application. The HTML Speech protocol is defined as a sub-protocol of WebSockets [WS-PROTOCOL], and enables HTML user agents and applications to make interoperable use of network-based speech service providers, such that applications can use the service providers of their choice, regardless of the particular user agent the application is running in. The protocol bears some similarity to [MRCPv2]. However, since the use cases for HTML Speech applications are in some places considerably different from those around which MRCPv2 was designed, the HTML Speech protocol is not merely a transcript of MRCP, but shares some design concepts, while simplifying some details, and adding others. Similarly, because the HTML Speech protocol builds on WebSockets, its session negotiation and media transport needs are quite different from those of MRCP.

1. Rationale for mixing media and control in same session

Unlike MRCP, in the HTML speech protocol, the control signals and the media itself are transported over the same Websocket connection. This design is motivated by simplicity and desire to keep the protocol within HTTP. Earlier implementations utilized a simple HTTP connection for speech recognition and synthesis. Use cases involving continuous recognition motivated the move to Websockets. One benefit is that by limiting the protocol to Websockets over HTTP there should be less problems with firewalls compared to having a separate RTP connection of other for the media transport.

2. EMMA Examples

2.1 1-best

Example showing 1-best with an XML semantics within emma:interpretation. The 'interpretation' is contained within the emma:interpretation element. The 'utterance' is the value of emma:tokens and 'confidence' is the value of emma:confidence.

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
    xmlns="http://www.example.com/example">
    <emma:grammar id="gram1" 
    grammar-type="application/srgs-xml" 
    ref="http://acme.com/flightquery.grxml"/>
  <emma:interpretation id="int1" 
	emma:start="1087995961542" 
	emma:end="1087995963542"
	emma:medium="acoustic" 
	emma:mode="voice"
	emma:confidence="0.75"
   emma:lang="en-US"
   emma:grammar-ref="gram1"
   emma:media-type="audio/x-wav; rate:8000;"
   emma:signal="http://example.com/signals/145.wav"
	emma:tokens="flights from boston to denver"
   emma:process="http://example.com/my_asr.xml">
      <origin>Boston</origin>
      <destination>Denver</destination>
  </emma:interpretation>
</emma:emma>

[ emma:grammar-type is from EMMA 1.1. ]

Similar example but with a JSON semantic payload rather than XML.

<emma:emma
  version="1.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
  xmlns="http://www.example.com/example"> 
  <emma:grammar id="gram2" 
  grammar-type="application/srgs-xml" 
  ref="http://acme.com/pizzaorder.grxml"/>

  <emma:interpretation id=“int1"
    emma:start="1087995961542" 
	 emma:end="1087995963542"
    emma:confidence=".75”
    emma:medium="acoustic" 
    emma:mode="voice" 
    emma:verbal="true"
    emma:function="dialog"
    emma:lang="en-US"
    emma:grammar-ref="gram2"
    emma:media-type="audio/x-wav; rate:8000;"
    emma:signal="http://example.com/signals/367.wav"
    emma:tokens="a medium coke and 3 large pizzas with pepperoni and mushrooms"
    emma:process="http://example.com/my_asr.xml">
      <emma:literal> 
        <![CDATA[
              {
           drink: {
              liquid:"coke",
              drinksize:"medium"},
           pizza: {
              number: "3",
              pizzasize: "large",
              topping: [ "pepperoni", "mushrooms" ]
           }
          } 
          ]]>
      </emma:literal> 
  </emma:interpretation> 
</emma:emma>

For EMMA 1.1 there is an attribute to specify the type of interpretation payload: emma:semantic-rep="json".

2.2 N-best

Example showing multiple recognition results and their associated interpretations.

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
    xmlns="http://www.example.com/example">
    <emma:grammar id="gram1" 
    grammar-type="application/srgs-xml" 
    ref="http://acme.com/flightquery.grxml"/>
    <emma:grammar id="gram2" 
    grammar-type="application/srgs-xml" 
    ref="http://acme.com/pizzaorder.grxml"/>
  <emma:one-of id="r1" 
	emma:start="1087995961542"
	emma:end="1087995963542"
	emma:medium="acoustic" 
	emma:mode="voice"
   emma:lang="en-US"
   emma:media-type="audio/x-wav; rate:8000;"
   emma:signal="http://example.com/signals/789.wav"
   emma:process="http://example.com/my_asr.xml">
    <emma:interpretation id="int1" 
    	emma:confidence="0.75"
    	emma:tokens="flights from boston to denver"
       emma:grammar-ref="gram1">
      		<origin>Boston</origin>
      		<destination>Denver</destination>
    </emma:interpretation>
    <emma:interpretation id="int2" 
    	emma:confidence="0.68"
    	emma:tokens="flights from austin to denver"
		emma:grammar-ref="gram1">
      		<origin>Austin</origin>
      		<destination>Denver</destination>
    </emma:interpretation>
  </emma:one-of>
</emma:emma>

2.3 No-match

In the case of a no-match the EMMA result returned must be annotated as emma:uninterpreted="true".

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
    http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation id="interp1" 
  	emma:uninterpreted="true"
    emma:medium="acoustic" 
    emma:mode="voice"
    emma:process="http://example.com/my_asr.xml"/>
</emma:emma>

2.4 No-input

In the case of a no-match the EMMA interpretation returned must be annotated as emma:no-input="true" and the <emma:interpretation> element must be empty.

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation id="int1" 
	emma:no-input="true"
	emma:medium="acoustic"
	emma:mode="voice"
   emma:process="http://example.com/my_asr.xml"/>
</emma:emma>

2.5 Multimodal

Example showing a multimodal interpretation resulting from combination of speech input with a mouse event passed in through a control metadata message.

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:interpretation
      emma:medium="acoustic tactile" 
      emma:mode="voice touch"
      emma:lang="en-US"
      emma:start="1087995963542"
	   emma:end="1087995964542"
      emma:process="http://example.com/myintegrator.xml">
    <emma:derived-from resource="voice1" composite="true"/>
    <emma:derived-from resource="touch1" composite="true"/>
    <command>
       <action>zoom</action>
       <location>
         <point>42.1345 -37.128</point>
        </location>
     </command>
  </emma:interpretation>
   <emma:derivation>
  		<emma:interpretation id="voice1"
			emma:medium="acoustic"
			emma:mode="voice"
           emma:lang="en-US"
           emma:start="1087995963542"
	        emma:end="1087995964542"
           emma:media-type="audio/x-wav; rate:8000;"
			emma:tokens="zoom in here"
           emma:signal="http://example.com/signals/456.wav"
           emma:process="http://example.com/my_asr.xml">
 			<command>
       		 <action>zoom</action>
       		 <location/>
     		</command>  
        </emma:interpretation>
        <emma:interpretation id="touch1"
			emma:medium="tactile"
			emma:mode="touch"
           emma:start="1087995964000"
	        emma:end="1087995964000">
             <point>42.1345 -37.128</point>
        </emma:interpretation>
   </emma:derivation>
</emma:emma>

2.6 Lattice

As an example of a lattice of semantic interpretations, in a travel application where the source is either "Boston" or "Austin"and the destination is either "Newark" or "New York", the possibilities might be represented in a lattice as follows:

<emma:emma version="1.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:grammar id="gram1" 
    grammar-type="application/srgs-xml" 
    ref="http://acme.com/flightquery.grxml"/>
  <emma:interpretation id="interp1"
	emma:medium="acoustic" 
	emma:mode="voice"
	emma:start="1087995961542" 
	emma:end="1087995963542"
	emma:medium="acoustic" 
	emma:mode="voice"
	emma:confidence="0.75"
	emma:lang="en-US"
	emma:grammar-ref="gram1"
   emma:signal="http://example.com/signals/123.wav"
	emma:media-type="audio/x-wav; rate:8000;"
   emma:process="http://example.com/my_asr.xml">
     <emma:lattice initial="1" final="8">
      <emma:arc from="1" to="2">flights</emma:arc>
      <emma:arc from="2" to="3">to</emma:arc>
      <emma:arc from="3" to="4">boston</emma:arc>
      <emma:arc from="3" to="4">austin</emma:arc>
      <emma:arc from="4" to="5">from</emma:arc>
      <emma:arc from="5" to="6">portland</emma:arc>
      <emma:arc from="5" to="6">oakland</emma:arc>
      <emma:arc from="6" to="7">today</emma:arc>
      <emma:arc from="7" to="8">please</emma:arc>
      <emma:arc from="6" to="8">tomorrow</emma:arc>
    </emma:lattice>
  </emma:interpretation>
</emma:emma>

HTML Speech XG Proposed Protocol Approach