Review Comments from Alan Goldschen on 2000-01-17 (www-voice@w3.org from January to March 2000)

From: Alan Goldschen <alang@mitre.org>
Date: Sun, 16 Jan 2000 23:14:38 -0500 (EST)
To: www-voice@w3.org
Message-ID: <3882966A.E1834818@mitre.org>
Dear W3C:

This review is from the following MITRE members of the DARPA
Communicator team:
	
Sam Bayer	sam@mitre.org
Sasha Caskey	sasha@mitre.org
Alan Goldschen	alang@mtire.org
Dan Loehr	loehr@mitre.org

I have collected the comments from these members and have integrated
them into one response from MITRE.  I have chosen not to omit any
member's response and have presented them in entirely.  When there was
overlap among responses for a section, then I have chosen integrate the
responses. 

Thank you,

Alan Goldschen
MITRE 
703-883-6005
alang@mitre.org

-----------------------------------------------------------

Introduction

A comment from one reviewer:

"I think that the name of the group is extremely misleading, and its
mission is questionable. In particular, the discussion of the definition
of "voice browser" makes it clear that the relationship between the
objects this group is discussing and what we think of as "Web browsers"
is conceptual, to be charitable. In fact, this working group seems to be
discussing voice-driven dialogue systems, which bear a very tenuous
relationship to what the W3C traditionally does. It seems to have
nothing to do with protocol level discussions, such as X or HTTP; it
seems to have very little to do with document level discussions, such as
standards for XML or HTML; rather, it seems to be a discussion about
software engineering for speech-enabled systems. The attempt to make
this relevant by virtue of appealing to the notion of "voice markup
language" is, I think, beside the point; the fact that dialogues can be
encoded in structured documents does not automatically bring dialogue
under the purview of W3C. The only possible justification I can think of
for this group to be discussing these sorts of systems is that the
information they access may be accessed via a protocol running over
TCP/IP. This isn't enough of a justification for me.

One might argue in response that once you start trying to tackle the
problem of accessing Web information without visual output, a radically
different approach ends up justifying itself, because attempting to
replicate in audio the visual presentation of information is clearly
foolish. I certainly agree with this contention; however, it's not at
all clear to me that it justifies what I perceive as a radical
distension of the W3C mission."

Another comment

"The hard work in defining a system such as this is in the details, and
the details can swamp the effort. As I've noted, the Communicator
community is having trouble converging on a simple domain model for even
one domain; and the semantics of the data are crucial at every step in
the representation. The Communicator community would love to pull on
standards promulgated by the W3C, but I'm not sure that the hard
questions are being faced; that is, it would be possible for a set of
components to meet all these requirements and still fail to work
together in a given domain."

-----------------------------------------------------------

Architecture

The architecture is much in-line with what we have in DARPA
Communicator.   I thought that they over loaded the Dialog Manager a
little too much. In particular, I believe that NL generation should be a
separate module from the Dialog Manager.

Here are some of the tasks mentioned for which the DM is responsible:
* Anaphora resolution
* Implied expressions
* Responding to the input statement
* Resolving ambiguity
*  Issuing instructions and/or queries to the task manager
* Collecting output from the task manager
* Formation of a natural language expression (overload)
* Visual presentation of the task manager output (overload)
* Coordination of recognizer context


This architecture is described as a representative model, not a
recommendation, so it's hard to know how much of this to treat in
detail.  It is unclear what is required in this very important section
because there were no requirements of "must have", "should have", etc.  
I would strongly recommend that this section be re-visited in light of
the other sections having been written, with special focus on the
requirements. Figure 1 is not very clear in the definition of which
components are part of the server or the client.   I would like to see a
"should have" nor "must have" by each of the components.

The section does not discuss security nor privacy issues.

How do the components in Figure 1 communicate with each other?  Is TCP
used?  What are the protocols?   What are issues for the different
architecture and different languages?

How would general errors be reported?  Is there a central component?

How would logging be done?  

How about replay?

I am confused about the placement of the "ASR, keyboard, pointing" box
in Figure 1.  Why does the telephone input eventually lead to the "ASR,
keyboard, and pointing box?  I would recommend that separate boxes be
specified for each input device (as well as each output device).  

The multi-media integration should be broken into two components: one
for input and one for output.  These separate boxes would make the some
or the other components easier to follow in Figure 1.  For example: NL
interpreter receives input from the multi-media integration.    I can
follow this if we are dealing with multi-media input.  I have a hard
time understanding how NL interpreter works with multi-media output.

One clear omission is some sort of notion of session. The diagram has
audio entering the input synthesis module via different protocols 
(telephone, IP). It also refers to keyboard and pointing, but fails to
take into consideration the fact that these devices may be distinct from
the audio channel. For instance, MIT's DARPA Communicator demo has a
telephone + Web interface, which one can imagine starting up either by
having the user log in to the Web interface and have the system call the
user, or by having the user log in to the Web interface and be given a
phone number to call, or by the user calling in on the phone and being
given an ID to type to the Web browser, etc. The key observation is that
there are multiple streams coming in to the system, and they must be
"bundled" in such a way those different users' streams remain distinct.
The Communicator architecture has this idea built deep into the system,
and I believe that this is correct.

This issue arises in the dialogue requirements document, section 1. My
experience with Communicator suggests that this capability needs to be
planned for at the start, and that it's not easy or elegant to insert it
later. The designations for sections 1.3 ("should have") and 1.4 ("nice
to have") in this latter document suggest that consideration of this
feature will be postponed, which I think is unwise.

-----------------------------------------------------------

Dialog Requirements

1.0 
The goals of this section were not clear because it was not clear how
this group differs from the multi-modality group.  Does the multi-modal
group subsume the dialog requirements group?  I would assume this to be
the case since dialogue requirements for W3C voice browsers is
multi-modal by definition.

It's not at all clear to me that the complexities of dialogue can be
reduced successfully to the sort of finite-state model implied by the
kind of encoding represented in this document and specifications like
VoxML. It's not even clear to me that the human factors of simple tasks
can be addressed. I don't have any concrete recommendations here, only
an indication of skepticism.


1.2
I do not understand the meaning of the word "interpreted" in the first
sentence (as well in other sections of this document).  This section is
labeled as a "must have" - yet the definition of "interpreted" is left
unspecified.  I would recommend that the group be very clear on its
precise meaning of "interpreted".    What exactly is the system to do? 
What are the requirements?

1.3
 I think this should be upgraded to a "must have ". I see no reason not
to, the technology is well understood.  I don't think W3C should
undermine the necessity of multi-modal input, even in its first draft.

Is Co-ordinated a word?


1.4 
Upgrade to "should have".   There are systems that achieve this today
(QuickSet from OGI).


1.6 Again this should be upgraded to at least a "should have". It is
extremely important to allow different modes of output.


2.1.2
  "Note confirmation sub-dialogs take precedence over clarification
sub-dialogs."
I'm not sure why the _precedence_ of confirmation vs. clarification
sub-dialogs need be specified in the requirements.  They could each just
be specified, and the precedence strategy can be decided by the
individual application (browser).

2.1.4
Over-informative Input: additional (nice to have) - we believe that it
should be upgraded to a "should-have" because:

First, Throwing away information provided by the user will lead to
frustration for the user.  Keeping the information is certainly doable,
and will be expected in a good dialog system.

Second, I consider additional information in user input to be a form of
mixed initiative and should therefore be upgraded to at least a "should
have".


2.3 Help Behavior (should have)

I think this is a "must-have".  Every good user interface has one, and
it is certainly doable.

2.6
This section is considered a "should have" but 2.9 a "must have" isn't
that inconsistent? If the infrastructure is there for 2.9, why not just
make 2.6 a "must have" as well?


2.7
 Misspelling of 'bargein" - should be "barge in".    

The barge in described in this section pertains only to system output. 
It seems to me that barge-in capability is relevant to input as well as
output.   If barge-in were used for input, I would recommend that this
section increase its priority from "should have" to "must have".

3.5/3.6
Does the requirements for this section indicate that we need to have
some sort of compiler or grammar checker for the developer?

4.1
How does the event handling in this section differ from event handling
as specified in 3.7 and 3.8?  Section 4.1 and 3.7 is a "must have",
while Section 3.8 is a "should have".


4.2
Logging (nice to have)

>From our reviewers, we have one-reviewer wants this to be upgraded to
"should-have" another reviewer wants this upgraded to a "must-have": It
is hard for us to understand why this would not be included in any
system built.  The developers must have some form of logging to build
the system.


-----------------------------------------------------------

Grammar Representation

General Comment:
In general on Grammar Rep. Requirement's: A nice-to-have (but only a
nice-to-have) would be some way to capture the F0 contour of the input. 
This is beyond the state-of-the-art, but future end-to-end systems might
make use of the intonation to gleaning additional meaning.  This is only
worth adding if the group wants to include "next-generation"
requirements in the current document.  (Please note that this mirrors
the NL document, in "Input Requirements" #5, which states: Accept
prosodic notations on the ASR output. (Nice to specify).  Why isn't this
in the grammar representation document?

In general on Dialogue Requirements: this document talks about
sub-dialogues, etc., when the application determines that more
information is needed.  But where is the logic to decide this?  Where is
the interface with the back-end?  If it assumed to be inside of this
"dialog" module (if it is a module), they should explicitly say so.  Or
they should say that the application itself (the browser) takes care of
it.  (I'm just looking for the functionality provided by the Turn
Manager)


1.0
The grammar representation should include the definition of statistical
grammars - those not defined by finite state or by context free
grammars.    (Not all speech recognizers use a finite neither state nor
context free grammars. Additionally, the use of multi-modality related
grammars might not fall into these categories either.)   If section 1.3
were to support statistical grammars, then I recommend that 1.3 be
upgraded from "should specify" to "must specify".

The grammar representation should include a definition of the alphabet
mapping.  Furthermore, the developer should have a means to map the
language alphabet to speech acoustic models.  Without a mapping to
acoustic models, it is difficult to image how words are correctly
recognized and how out-of-vocabulary words are supported.

The grammar definition should provide support for the language and
sub-language dialects.

I assume that the grammar defined in this section is for voice input and
output.  I assume the multi-modal section provides for the definition of
non-speech grammar such as images/graphics.

1.4 
Out-of-Vocabulary Words (must specify)

How would out-of-vocabulary words be detected?  Would a threshold or
confidence score be used?  Why wouldn't the recognizer just fail to
recognize the word (with presumably low confidence score)?   Should we
add some confidence weights?

Or, does this section reflect a requirement as a mechanism to introduce
an OOV word?  One idea is to use a meta-command such as "Accept new
word" followed by "foo" or "eff-oh-oh" or something?  This technique
must follow the phone definitions as well as the phonetic-to-acoustic
mapping.


4.0 
I would like to see more explanation and definition of the word
"semantics".   A lack of consistent meaning could lead to inconstant
interoperations by various developers.

4.2 
Semantic tagging -- One criterion possibly missing here is some
distinguished way of indicating the ontological type of the
interpretation. This may be accomplished by identifying a distinguished
attribute in the AV representation returned.

5.0 
It is not clear what Post Processing is for dialogue?  Is assume that
post-processing means that the recognizer can retrieve grammar rules and
specifications that occurred for a particular utterance?  Is this the
case?

6.1 
I am not sure of the meaning of "native format".  I would have assumed
native format to mean a particular format from a vendor or computer
hardware, or this could be implying binary format.  Would a better title
be "machine or vendor specific format"?

7.0
This section seems to imply that the dialogue must support XML.  Would a
specification such as "a developer must provide a translator that
translates from/to the XML format" be appropriate?

8.6
Add sentence that says that if weights are not specified, then the
default is equally likely.

-----------------------------------------------------------

Natural Language Processing

Some general comments:

First, this section is open to many interoperations and many developers
could have different strategies.  The specifications are vague on the
exact contents of what are in the messages.   (I can understand why -
the problem is hard.)  However, by not specifying the exact content - it
becomes extremely difficult to build this component.   Is this group
intertwined with the "applications group"?

Second we in the Communicator project have a number of groups working on
the same task, with very similar (intended to be common) data, and we
can't agree on the contents of a frame representation for that single
domain. Agreeing on a "syntactic" encoding of the output of NL is a
small first step; the real problem is the domain modeling, which you
don't even have a group for. 

Third, at the very least, you should consider defining the
interpretations of common elements like names, money, dates, time
expressions, etc. Communicator participants can find one such attempt at
http://fofoca.mitre.org/members/CAC/frame_exchange_proposal_v1.html.  If
the W3C is interested, MITRE would be happy to make this document
available to them, with the understanding that this is a MITRE proposal
to the Communicator project which has not been accepted yet (again, an
indication of how difficult it is to converge on even the simplest
ontological issues).

Fourth, under "Task-specific information", sub-heading "Represent task
information", requirements 3 and 4 (handle anaphora/ellipsis, and
deixis) are hard to do and are not solved problems. They might be
labeled "should-have" instead of "must-have".

Fifth, the NL requirements have no mention of context tracking
(including dialogue history).  Are we to assume that context tracking
functionality is buried under the above requirements to handle
anaphora/ellipsis/deixis? 

Sixth, many of the requirements throughout this document are intertwined
with those of the dialog document (like ask for help, repeat last
statement).  They don't appear to have been cross-checked with the
dialog document, so that both are in sync (that is, both documents have
the same requirements, and with the same level of desirability
(must-have, should-have, etc.))  There have been many examples where the
Dialog requirements mention one priority and the NLP another (for
example: 2. Represent a request for help (must specify) in NL and
"should have" in Dialog)


Input Requirements-

1. Considered a must have, however there is no standard for acoustic
scores. How would NLU units account for different syntax/meaning of
acoustic scores for different recognizers for example? Should either
restrict the must have to a list of sentences and a "should have"
optional acoustic scores or remove acoustic scores from here 

3. I see no reason why this should not be a "must have". It is simple
enough and should be included in the first draft (e.g. keyboard input,
really good recognizer)

4. Again it worries me that word lattice (which is not a standard) be
required as a "must have".

-----------------------------------------------------------

Speech Synthesis

General Comment:

Where is natural language generation?  Or any output generation?  They
have something like it in the Synthesis document (section 3.11), but
that's a "future revision".

2.5: 
Do you REALLY want to require multi-modal synthesis? This is awfully
ambitious.

3.8 
Special text constructs: I'd require that these markup elements be
mirrored in the grammar specifications. Since they imply a simple
ontology, it would be appropriate for this ontology to be consistently
applied throughout the system.

4.3 
Intonation control (should have)

This should be a "must-have", especially since the next requirement
(Acoustic Prosodics) is a must-have.  Intonation control is as important
as acoustic prosodics, and is just as doable.  (E.g., "Acoustic
Prosodics" specifies pitch and pitch range, which is related to what is
needed for "Intonation Control").
Received on Monday, 17 January 2000 05:04:31 UTC