Feedback on "Semantic Interpretation for Speech Recognition" from the Multimodal Interaction Working Group from Deborah Dahl on 2005-02-26 (www-voice@w3.org from January to March 2005)

From: Deborah Dahl <dahl@conversational-technologies.com>
Date: Sat, 26 Feb 2005 12:07:16 -0500
To: <www-voice@w3.org>
Message-ID: <00e501c51c25$a30c57f0$3a7ba8c0@chimaera>
Dear Voice Browser Working Group,

The Multimodal Interaction Working Group has reviewed the 
Semantic Interpretation for Speech Recognition Last Call
Working Draft [1] in response to the WG's request 
and has prepared the following feedback. 
Thanks to the EMMA [2] subgroup and to Michael Johnston in
particular for preparing the feedback.

best regards,

Debbie Dahl, MMI WG Chair

Summary:

The EMMA subgroup have reviewed the SI specification in some
detail. Overall we find SI to be very well suited for the production of
application semantics in XML for inclusion in EMMA documents.

We have a number of comments, broken down into three
sections. The first contains suggestions for simple editorial
changes. The second contains some suggestions for an informative
appendix exemplifying how SI can be used to construct EMMA
annotations on elements within the application semantics. The
third contains a proposal for extending the SI specification
in order to provide access to word/phrase timing information
within SI rules in order to support multimodal integration.

[1] http://www.w3.org/TR/semantic-interpretation/

[2] http://www.w3.org/TR/emma/

==========================================================
Feedback from MMI EMMA subgroup on the VB SI specification
==========================================================

Michael Johnston, AT&T
Wu Chou, Avaya
Debbie Dahl, Conversational Technologies
Gerry McCobb, IBM
Paulo Baggia, Loquendo
Dave Raggett, Canon, W3C


The EMMA subgroup have reviewed the SI specification in some
detail. Overall we find SI to be very well suited for the production of
application semantics in XML for inclusion in EMMA documents.

We have a number of comments, broken down below into three
sections. The first contains suggestions for simple editorial
changes. The second contains some suggestions for an informative
appendix exemplifying how SI can be used to construct EMMA
annotations on elements within the application semantics. The
third contains a proposal for extending the SI specification
in order to provide access to word/phrase timing information
within SI rules in order to support multimodal integration.


1. Editorial changes
====================

1.1 Descriptions of EMMA
========================

It would be good if we could clarify the role of EMMA in standardizing
the containers and annotation for semantic representation rather
than the semantic representation of user utterances itself.
How about the following changes:

Third para of abstract:
-----------------------
change last sentence:

"The W3C Multimodal Interaction Activity is defining a data format
(EMMA) for representing the information contained in user utterances"

-->

"The W3C Multimodal Interaction Activity is defining an XML data format
(EMMA) for containing and annotating the information in user utterances"


Section 1.1
-----------
likewise,
second to last para:

suggest change the first sentence to

"The W3C Multimodal Interaction Activity is defining an XML data format
(EMMA) for containing and annotating the information in user utterances"


In the examples in section 3.3.2
--------------------------------
Typo: 4th last sentence, "of the for the" --> "of the"

Section 5, fifth paragraph
--------------------------
second sentence, "can not" --> "cannot"



2. Suggestion for Informative Appendix
======================================

The critical cases for compatibility of SI and EMMA are
situations where EMMA annotations appear on elements within
the application semantics generated by SI rules.

The creation of emma elements such as emma:one-of will need
to carried out by the processor which applies the SI rules, but
these elements cannot be built within the SI scripts
themselves since they contain the results of multiple different
parses, possibly of different recognition results.

<emma:emma ..>
	<emma:one-of>
		<emma:interpretation>
			APPLICATION NAMESPACE ELEMENTS BUILT USING
			SI SCRIPTS FOR FIRST RESULT STRING
		</emma:interpretation>
		<emma:interpretation>
			APPLICATION NAMESPACE ELEMENTS BUILT USING
			SI SCRIPTS FOR SECOND RESULT STRING
		</emma:interpretation>
		....
	</emma:one-of>
</emma:emma>
		
In the review of the SISR document by the EMMA subgroup we identified
three main situations in which EMMA annotations would appear within the
application semantics and be generated using SI scripts.

The first two of these can be handled with the existing
mechanisms of SI and we would like to propose the
inclusion of an informative appendix in the SI specification
showing how SI scripts can be used to build these EMMA
annotations. The EMMA specification itself contains an appendix showing
how emma:hook annotations can be built using SRGS/SI.

2.1 emma:hook
=============

emma:hook is used to indicate that a piece of semantic content
needs to be combined with content from another mode. The mode
required is indicated as the value of emma:hook.

This can be readily handled with the existing SI specification
using the _nsprefix property.

As an example, to create the emma:hook="ink" annotation
in the semantics of "zoom in here" the following rule
could be used:

<rule id="zoom">
    zoom in here
    <tag>
      $.command = new Object();
      $.command.action = "zoom";
      $.command.location = new Object();
      $.command.location._attributes = new Object();
      $.command.location._attributes.hook = new Object();
      $.command.location._attributes.hook._nsprefix = "emma";
      $.command.location._attributes.hook._value = "ink";
      $.command.location.type = "area";
    </tag>
</rule>

The resulting ECMAscript object would be as follows:

{
command: {
          action: "zoom"
          location: {
            _attributes: {
               hook: {
                 _nsprefix: "emma"
                 _value: "ink"
                 }
               }
            type: "area"
           }
   }
}

SI processing in an XML environment would generate the following document:

<command>
    <action>zoom</action>
      <location emma:hook="ink">
         <type>area</type>
      </location>
</command>

We will submit a separate CR to update the EMMA working draft
to update the appendix 9.1 so that it uses _nsprefix as above.


2.2 emma:tokens
---------------

A second common use of emma annotations within the
application semantics is to annotate the specific
words/tokens associated with some part of the
semantics. These can be used by later stages of
dialog processing and generation in order to construct
confirmation questions. For example:

If the user says

"show flights from kennedy airport"

the system might want to respond using the
actual words that the speaker used, in
making a confirmation:

"did you say you want to leave from 'kennedy airport'"

This can also be achieved using the existing specification
using the .text property associated with rules.

We discussed the possibility of having a general mechanism
which assigned an emma:tokens value of the basis of every
rule application, but faces two problems: there will generally
not be a one to one relationship between the
derivation tree of the parse and the XML elements in the
resulting semantics, and also adding emma:tokens to every
part of the semantics would also be verbose.

We agreed that while some more general mechanism could be
explored in future, for now the method would be to
explictly create the emma:tokens attribute.

As an example here, I will show a simple example for

"flights to kennedy airport"

With the following SRGS/SI rules:

<rule id="flight">
    flights to
    <ruleref uri="#city"/>
    <tag>
      $.command = new Object();
      $.command.action = "flight";
      $.command.destination = new Object();
      $.command.destination._attributes = new Object();
      $.command.destination._attributes.tokens = new Object();
      $.command.destination._attributes.tokens._nsprefix = "emma";
      $.command.destination._attributes.tokens._value = meta.city.text;
      $.command.destination._value = $city;
    </tag>
</rule>

<rule id="city">
	<one-of>
		<item>kennedy airport<tag>$="JFK"</tag></item>
		<item>san francisco<tag>$="SFO"</tag></item>
		<item>john f kennedy <tag>$="JFK"</tag></item>
	</one-of>
</rule>

The resulting ECMAscript object would be as follows:

{command: {
        action: "flight"
        destination: {
          _attributes: {
             tokens: {
               _nsprefix: "emma"
               _value: "kennedy airport"
               }
             }
          _value: "JFK"
        }
}}

SI processing in an XML environment would generate the following document:

<command>
    <action>flight</action>
    <destination emma:tokens="kennedy airport">JFK</destination>
</command>


Notes that while it is not possible to determine what the user
actually said from the semantics JFK, it is possible to determine what they
said from the emma:tokens, "kennedy airport".



3. Addition of temporal metadata to SI/SRGS to support the
creation of EMMA timestamps and multimodal integration
==========================================================

Another kind of EMMA annotation which is needed within the application
semantics, and which is particularly important for multimodal applications
involving multimodal integration, is the annotation
of the timespan associated with a particular piece of the
semantic representation. For example, if the other modality involves
computer vision or gaze, the multimodal integration component will have to
  determine where the user was pointing or looking when certain words
were said, e.g when they say "zoom in here" you want to know where
they were pointing so you can use that location as the place to zoom in on.

In order enable the use of SI for applications with temporal
constraints on multimodal integration, two additional associated variables
would be needed in the SI specification, one to indicate the
start time of the time interval associated with the words parsed by a rule
and one to indicate the end time. For consistency with the EMMA
annotation emma:start and emma:end these could be called .start and .end.
If timing information was required on a particular element in the
semantic representation it could be determined by accessing these.

The values of the .start and .end associated variables should be
absolute timestamps in milliseconds.

Extending the "zoom in here" example from 1. above this would work as
follows,
assuming that the word "here" starts at 1087995961542 and ends at
1087995961642

<rule id="zoom">
    zoom in
    <ruleref uri="#here"/>
    <tag>
      $.command = new Object();
      $.command.action = "flight";
      $.command.location = new Object();
      $.command.location._attributes = new Object();
      $.command.location._attributes.hook = new Object();
      $.command.location._attributes.hook._nsprefix = "emma";
      $.command.location._attributes.hook._value = "ink";
      $.command.location._attributes.start = new Object();
      $.command.location._attributes.start._nsprefix = "emma";
      $.command.location._attributes.start._value = meta.here.start;
      $.command.location._attributes.end = new Object();
      $.command.location._attributes.end._nsprefix = "emma";
      $.command.location._attributes.end._value = meta.here.end;
      $.command.location._value = $here;
    </tag>
</rule>

<rule id="here">
      here <tag>$.type = "area"</tag>
</rule>

The resulting ECMAscript object would be as follows:


{command: {
        action: "zoom"
        location: {
          _attributes: {
             hook: {
               _nsprefix: "emma"
               _value: "ink"
               }
             start: {
               _nsprefix: "emma"
               _value: "1087995961542"
               }
             end: {
               _nsprefix: "emma"
               _value: "1087995961642"
               }
             }
          type: "area"
        }
}}


SI processing in an XML environment would generate the following document:

<command>
    <action>zoom</action>
      <location emma:hook="ink"
                emma:start="1087995961542"
                emma:end="1087995961642">
         <type>area</type>
      </location>
</command>


Other more specific names could be used if it is undesirable to reserve
'start' and 'end' for this metadata.

This example could be reworked so the contents of the
location object is defined in the 'here' rule, in that
case the start and end of the word would be accessed through
$.start and $.end.

EMMA also supports relative timestamps through the attributes
emma:time-ref-uri, emma:offset-to-start, and emma:duration.
If absolute timing information is not available, and alternative
would be to make the information needed to build a relative
timestamp available to the SI scripts. The three pieces required
would be a URI pointing to the start of the speech input, and offset
in milliseconds from the start of the speech to the start of the
phrase covered by the current rule, and the duration of the
phrase. For example the following properties could be used:
.timerefuri .offset_ms and .duration_ms.
Received on Saturday, 26 February 2005 17:07:58 UTC