Re: [emma] Comments on EMMA LCWD by VBWG from johnston@research.att.com on 2007-04-01 (www-multimodal@w3.org from April 2007)

From: <johnston@research.att.com>
Date: Sun, 1 Apr 2007 16:59:55 -0400
To: <www-multimodal@w3.org>
Message-ID: <0C50B346CAD5214EA8B5A7C0914CF2A4290BDA@njfpsrvexg3.research.att.com>
 

The multimodal working group greatly appreciate the detailed

feedback on the EMMA specification from the Voice 

Browser working group. As a result of this feedback, and

feedback from other groups, a number of substantive

revisions have been made to the EMMA specification 

along with reorganization and numerous editorial changes. 

The multimodal working group will shortly release a 

second last call working draft incorporating these revisions.

 

The points below provide formal responses to the 

feedback specifically from the VB working group.

 

Michael Johnston

AT&T 

Editor-in-Chief EMMA Specification

 

 

 

Formal response to feedback from Voice Browser 

working group on EMMA last call:

===============================================

 

VB-A1: Clarification / Typo / Editorial

 Request of EMMA profile for VoiceXML 2.0/2.1

 

Resolution: Rejected

 

The Multimodal working group sees significant benefit in the creation

of an EMMA profile for VoiceXML 2.0/2.1.  However, the group 

rejects the request to have this work within the EMMA specification

itself. The request might best be resolved by a W3C Note on these
issues,

or maybe more broadly on the whole chain that connects a VoiceXML page,

to SRGS+SISR grammars and then EMMA to return speech/dtmf results

to VoiceXML. We suggest that this document should be edited by VBWG

with some support from MMIWG.

 

==================

 

VB.A1.1: Change to Existing Feature

  Profile: Values of emma:mode

 

Resolution: Accepted

 

The MMIWG agree that the values of emma:mode of 

specific relevance to VXML should be revised in EMMA.

 

For the current editor draft, and for the candidate

recommendation we will change the emma:mode values

as follows in Section 4.2.11 and throughout the 

document as follows:

   - from "dtmf_keypad" to "dtmf"

   - from "speech" to "voice"

  

 

==================

 

 

VB.A1.2: Clarification / Typo / Editorial

  Profile: Optional/Mandatory

 

Resolution: Accepted w/ modifications

 

With respect to the specification of what is 

optional and mandatory for the profile that information

should be part of the EMMA VXML profile we propose 

should be edited within the VBWG (See VB.A1).

As regards the option/mandatory status of EMMA

features separate from any specific profile we

haved reviewed them in detail for the whole EMMA

specification and this will be reflected in the 

next draft. 

 

 

==================

 

 

VB.A3: Clarification / Typo / Editorial

  Profile: DTMF/speech

 

 

Resolution: Rejected

 

 

See VB.A1 for profile. The MMIWG agree that this should

be made clear in the EMMA VXML profile document, but as

specified in VB.A1 above we propose that profile should be

edited within the VBWG. 

 

 

==================

 

 

VB.A4: Clarification / Typo / Editorial

  Profile: Record results

 

Resolution: Accepted

 

The MMIWG agree that for VXML and for more general use it is 

important that EMMA can be used to annotated recorded inputs. 

The specification already contains an attribute that can be used

to provide the URI of the recorded signal; this is the function of 

emma:signal. Additionally the specification provides the annotation

emma:function="recording" for indication that an input is a recording

rather than a command to an interactive system. 

 

One issue that arose in our review of this feedback, is that in the case
of 

recordings there would be no content within the emma:interpretation

element. Currently this is only possible if emma:no-input="true" or

emma:uninterpreted="true". The issue is whether a recording can

be marked as emma:uninterpreted="true". To resolve this issue 

the MMIWG propose to revise the EMMA specification so that is clear that

emma:uninterpreted="true" means that no interpretation

was produced but there is no implication that an attempt

was made to produce an interpretation. Recordings then should be

marked as emma:uninterpreted="true". 

 

For the record results, the duration of the recording can be 

determined from either the emma:duration feature or using

absolute timestamps by subtracting the value of emma:start from

emma:end. Potentially the size can be determined using a combination

of emma:duration and the emma:media-type. However to avoid burdening

the emma consumer with this additional calculation and to ensure that

exact size can be indicated, the MMIWG propose the addition of

a new annotation emma:signal-size which indicates the size of a

recording (referred to with emma:signal) in 8-bit octets. This

facilitates the integration of EMMA with VXML.

 

==================

 

 

VB.A5: Change to Existing Feature

  Profile: Informative Examples

 

 

Resolution: Rejected

 

 

See VB.A1 for profile.  Informative examples regarding the 

specific use of EMMA for VXML 2.0/2.1 are best handled in a

separate note or specification. As proposed in VB.A1 this 

work should be edited within the VBWG.

  

==================

 

 

VB.B

  EMMA and evolutions of VoiceXML

 

 

Resolution: Deferred

 

These comments are extremely useful for future versions of EMMA but 

go beyond the goal and requirements of the current specification.

 

 

==================================================
VBWG Comments on EMMA LCWD [1]
==================================================
 
EMMA provides a good attempt to extend and complete the annotation of
input from different modalities. This is an interesting evolution to be
evaluated in the context of the current recommendation and also for the
advances of Voice Browser standards.
 
We acknowledge that EMMA LCWD [1] is improved and more complete, but we
suggest a few extensions that are important for us and some general
comments too.
 
==================================================
A. EMMA and VoiceXML 2.0/2.1
 
In the today speech market, VoiceXML 2.0 [2] is the reference
specification with a large presence in the industry. Since Mar. 2004,
VoiceXML 2.0 is a W3C recommendation.
Moreover VoiceXML 2.1 [3] extends VoiceXML 2.0 by adding a restricted
number of new features, none of them has a strict relationship with EMMA
specification.
VoiceXML 2.1 is close to become a Proposed Recommendation.
 
In this context we see very valuable to have an EMMA profile for
VoiceXML 2.0/2.1, whose goal is to enable a quick adoption of EMMA in
the current voice browser industry.
EMMA might be conveyed by the adoption of protocols such as IETF MRCPv2
which gives options to adopt EMMA as the format for speech results.
 
The VoiceXML 2.0/2.1 states a limited number of annotations to be
collected for each speech or DTMF input (see Table 10 [4] of VoiceXML
2.0 spec). Moreover, multiple ASR results may be returned on application
request, they have to be represented as N-best (see [5] of VoiceXML 2.0
for details).
 
Here are some practical proposals of possible extensions to EMMA
specification.
 
----------
Proposal 1 - EMMA profile
 
Describe in EMMA spec a VoiceXML 2.0/2.1 profile, either in an Appendix
or in a Section of the specification.
This profile should describe the mandatory annotations to allow a
complete integration in a VoiceXML 2.0/2.1 compliant browser.
 
The VoiceXML 2.0/2.1 requires four annotations related to an input. They
are described normatively in [4], as shadow variables related to a form
input item.
The same values are also accessible from the application.lastresult$
variable, see [4].
 
The annotations are the following: 
- name$.utterance
  which might be conveyed by "emma:token" attribute ([6])
- name$.confidence
  which might be conveyed by "emma:confidence" attribute ([7])
  The range of values seem to be fine: 0.0 - 1.0, but
  some checks could be made in the schema of both the specs.
- name$.inputmode
  which might be conveyed by "emma:mode" attribute ([8])
  Proposal 1.1 for a discussion of its values
- name$.interpretation, is an ECMA script value containing
  the semantic result which has to be derived by the
  content of "emma:interpretation"
 
As regards the N-best results, see [5] for details, the one-of element
should be suitable to convey them to the voice Browser.
 
---
Proposal 1.1 - Values of emma:mode
 
Some clarification should be needed to explain how to map the values of
"emma:mode" ([8]) to the expected values of the "inputmode" variable
(Table 10 in [4]). The voiceXML 2.0/2.1 prescribes two values: "speech"
and "dtmf".
 
Anther option is to adopt in EMMA the exact values expected by VoiceXML
2.0/2.1 to simplify the mapping. Other related fine grained EMMA
annotation are not possible in VoiceXML 2.0/2.1.
 
---
Proposal 1.2 - Optional/mandatory
 
The profile should clarify which is mandatory and which is optional. For
instance N-best are an optional feature for VoiceXML 2.0/2.1, while the
other annotations are mandatory.
 
----------
Proposal 2 - Consider 'noinput' and 'nomatch'
 
Besides user input from a successful recognition, there are several
other types of results that VoiceXML applications deal with that should
be part of a VoiceXML profile for EMMA as well as the ones suggested in
Proposal 1.
 
'noinput' and 'nomatch' situations are mandatory for VoiceXML 2.0/2.1.
Since EMMA can also represent these, the EMMA annotations for 'noinput'
and 'nomatch' should be part of the VoiceXML EMMA profile.
 
Note that in VoiceXML 'nomatch' may carry recognition results as
described in Proposal 1 to be inserted in the application.lastresult$
variable only.
 
----------
Proposal 3 - DTMF/speech
 
It is very important that EMMA will be usable for either speech or DTMF
input results, because VoiceXML2.0/2.1 allows both these inputmode
values. We expect that the VoiceXML profile in EMMA will make this clear
to enforce a complete usage of EMMA for Voice Browser applications. 
 
----------
Proposal 4 - Record results
 
EMMA can represent the results of a record operation, see the
description of the record element of VoiceXML [9], so the EMMA
annotations for recordings should also be part of a VoiceXML profile.
This feature is optional in VoiceXML 2.0/2.1.
 
----------
Proposal 5 - Add informative examples
 
We think that some informative examples will improve the description of
the profile. This might include a SISR grammar for DTMF/speech and the
expected EMMA results to be compliant to the profile.
 
The examples should include both single result returned and N-best
results.
 
We think that also an alternative example of lattices would be very
interesting, even if in the VoiceXML 2.0/2.1 it will not be
representable, but nonetheless it will be useful for the evolution of
VoiceXML, see point B below.
 
==================================================
B. EMMA and the evolution of VoiceXML
 
For the evolution of VoiceXML the current time is too premature to give
precise feedback, but the clear intention is to take care of an extended
usage of EMMA inside a future VoiceXML application.
 
This includes, but it is not limited to:
 
- leave access to the whole EMMA document inside the
  application.lastresult variable (both a raw EMMA document
  and a processed one, i.e. in ECMA-262 format)
- include proper media-types to allow a clear indication
  if the raw results are expressed in EMMA or other
  formats (e.g. NLSML). The same for the processed results.
 
Other possible evolutions will be to have a simple way to pass EMMA
results from VoiceXML to other modules to allow further processing.
 
A last point is that EMMA should be used to return results of Speaker
Identification Verification (SIV) too. Voice Browser SIV subgroup is
working to create a few examples to circulate them to you to get
feedbacks.
 
We will highly appreciate your comments on these ideas to better address
this subject in the context of the evolution of Voice Browser standards.
 
==================================================
[1] - http://www.w3.org/TR/2005/WD-emma-20050916/
[2] - http://www.w3.org/TR/2004/REC-voicexml20-20040316/
[3] - http://www.w3.org/TR/2005/CR-voicexml21-20050613/
[4] - http://www.w3.org/TR/2004/REC-voicexml20-20040316/#dml2.3.1
[5] - http://www.w3.org/TR/2004/REC-voicexml20-20040316/#dml5.1.5
[6] - http://www.w3.org/TR/emma/#s4.2.1
[7] - http://www.w3.org/TR/emma/#s4.2.8
[8] - http://www.w3.org/TR/emma/#s4.2.11
[9] - http://www.w3.org/TR/2004/REC-voicexml20-20040316/#dml2.3.6
Received on Sunday, 1 April 2007 21:01:12 UTC