- From: Alan Goldschen <alang@mitre.org>
- Date: Sun, 16 Jan 2000 23:14:38 -0500 (EST)
- To: www-voice@w3.org
Dear W3C: This review is from the following MITRE members of the DARPA Communicator team: Sam Bayer sam@mitre.org Sasha Caskey sasha@mitre.org Alan Goldschen alang@mtire.org Dan Loehr loehr@mitre.org I have collected the comments from these members and have integrated them into one response from MITRE. I have chosen not to omit any member's response and have presented them in entirely. When there was overlap among responses for a section, then I have chosen integrate the responses. Thank you, Alan Goldschen MITRE 703-883-6005 alang@mitre.org ----------------------------------------------------------- Introduction A comment from one reviewer: "I think that the name of the group is extremely misleading, and its mission is questionable. In particular, the discussion of the definition of "voice browser" makes it clear that the relationship between the objects this group is discussing and what we think of as "Web browsers" is conceptual, to be charitable. In fact, this working group seems to be discussing voice-driven dialogue systems, which bear a very tenuous relationship to what the W3C traditionally does. It seems to have nothing to do with protocol level discussions, such as X or HTTP; it seems to have very little to do with document level discussions, such as standards for XML or HTML; rather, it seems to be a discussion about software engineering for speech-enabled systems. The attempt to make this relevant by virtue of appealing to the notion of "voice markup language" is, I think, beside the point; the fact that dialogues can be encoded in structured documents does not automatically bring dialogue under the purview of W3C. The only possible justification I can think of for this group to be discussing these sorts of systems is that the information they access may be accessed via a protocol running over TCP/IP. This isn't enough of a justification for me. One might argue in response that once you start trying to tackle the problem of accessing Web information without visual output, a radically different approach ends up justifying itself, because attempting to replicate in audio the visual presentation of information is clearly foolish. I certainly agree with this contention; however, it's not at all clear to me that it justifies what I perceive as a radical distension of the W3C mission." Another comment "The hard work in defining a system such as this is in the details, and the details can swamp the effort. As I've noted, the Communicator community is having trouble converging on a simple domain model for even one domain; and the semantics of the data are crucial at every step in the representation. The Communicator community would love to pull on standards promulgated by the W3C, but I'm not sure that the hard questions are being faced; that is, it would be possible for a set of components to meet all these requirements and still fail to work together in a given domain." ----------------------------------------------------------- Architecture The architecture is much in-line with what we have in DARPA Communicator. I thought that they over loaded the Dialog Manager a little too much. In particular, I believe that NL generation should be a separate module from the Dialog Manager. Here are some of the tasks mentioned for which the DM is responsible: * Anaphora resolution * Implied expressions * Responding to the input statement * Resolving ambiguity * Issuing instructions and/or queries to the task manager * Collecting output from the task manager * Formation of a natural language expression (overload) * Visual presentation of the task manager output (overload) * Coordination of recognizer context This architecture is described as a representative model, not a recommendation, so it's hard to know how much of this to treat in detail. It is unclear what is required in this very important section because there were no requirements of "must have", "should have", etc. I would strongly recommend that this section be re-visited in light of the other sections having been written, with special focus on the requirements. Figure 1 is not very clear in the definition of which components are part of the server or the client. I would like to see a "should have" nor "must have" by each of the components. The section does not discuss security nor privacy issues. How do the components in Figure 1 communicate with each other? Is TCP used? What are the protocols? What are issues for the different architecture and different languages? How would general errors be reported? Is there a central component? How would logging be done? How about replay? I am confused about the placement of the "ASR, keyboard, pointing" box in Figure 1. Why does the telephone input eventually lead to the "ASR, keyboard, and pointing box? I would recommend that separate boxes be specified for each input device (as well as each output device). The multi-media integration should be broken into two components: one for input and one for output. These separate boxes would make the some or the other components easier to follow in Figure 1. For example: NL interpreter receives input from the multi-media integration. I can follow this if we are dealing with multi-media input. I have a hard time understanding how NL interpreter works with multi-media output. One clear omission is some sort of notion of session. The diagram has audio entering the input synthesis module via different protocols (telephone, IP). It also refers to keyboard and pointing, but fails to take into consideration the fact that these devices may be distinct from the audio channel. For instance, MIT's DARPA Communicator demo has a telephone + Web interface, which one can imagine starting up either by having the user log in to the Web interface and have the system call the user, or by having the user log in to the Web interface and be given a phone number to call, or by the user calling in on the phone and being given an ID to type to the Web browser, etc. The key observation is that there are multiple streams coming in to the system, and they must be "bundled" in such a way those different users' streams remain distinct. The Communicator architecture has this idea built deep into the system, and I believe that this is correct. This issue arises in the dialogue requirements document, section 1. My experience with Communicator suggests that this capability needs to be planned for at the start, and that it's not easy or elegant to insert it later. The designations for sections 1.3 ("should have") and 1.4 ("nice to have") in this latter document suggest that consideration of this feature will be postponed, which I think is unwise. ----------------------------------------------------------- Dialog Requirements 1.0 The goals of this section were not clear because it was not clear how this group differs from the multi-modality group. Does the multi-modal group subsume the dialog requirements group? I would assume this to be the case since dialogue requirements for W3C voice browsers is multi-modal by definition. It's not at all clear to me that the complexities of dialogue can be reduced successfully to the sort of finite-state model implied by the kind of encoding represented in this document and specifications like VoxML. It's not even clear to me that the human factors of simple tasks can be addressed. I don't have any concrete recommendations here, only an indication of skepticism. 1.2 I do not understand the meaning of the word "interpreted" in the first sentence (as well in other sections of this document). This section is labeled as a "must have" - yet the definition of "interpreted" is left unspecified. I would recommend that the group be very clear on its precise meaning of "interpreted". What exactly is the system to do? What are the requirements? 1.3 I think this should be upgraded to a "must have ". I see no reason not to, the technology is well understood. I don't think W3C should undermine the necessity of multi-modal input, even in its first draft. Is Co-ordinated a word? 1.4 Upgrade to "should have". There are systems that achieve this today (QuickSet from OGI). 1.6 Again this should be upgraded to at least a "should have". It is extremely important to allow different modes of output. 2.1.2 "Note confirmation sub-dialogs take precedence over clarification sub-dialogs." I'm not sure why the _precedence_ of confirmation vs. clarification sub-dialogs need be specified in the requirements. They could each just be specified, and the precedence strategy can be decided by the individual application (browser). 2.1.4 Over-informative Input: additional (nice to have) - we believe that it should be upgraded to a "should-have" because: First, Throwing away information provided by the user will lead to frustration for the user. Keeping the information is certainly doable, and will be expected in a good dialog system. Second, I consider additional information in user input to be a form of mixed initiative and should therefore be upgraded to at least a "should have". 2.3 Help Behavior (should have) I think this is a "must-have". Every good user interface has one, and it is certainly doable. 2.6 This section is considered a "should have" but 2.9 a "must have" isn't that inconsistent? If the infrastructure is there for 2.9, why not just make 2.6 a "must have" as well? 2.7 Misspelling of 'bargein" - should be "barge in". The barge in described in this section pertains only to system output. It seems to me that barge-in capability is relevant to input as well as output. If barge-in were used for input, I would recommend that this section increase its priority from "should have" to "must have". 3.5/3.6 Does the requirements for this section indicate that we need to have some sort of compiler or grammar checker for the developer? 4.1 How does the event handling in this section differ from event handling as specified in 3.7 and 3.8? Section 4.1 and 3.7 is a "must have", while Section 3.8 is a "should have". 4.2 Logging (nice to have) >From our reviewers, we have one-reviewer wants this to be upgraded to "should-have" another reviewer wants this upgraded to a "must-have": It is hard for us to understand why this would not be included in any system built. The developers must have some form of logging to build the system. ----------------------------------------------------------- Grammar Representation General Comment: In general on Grammar Rep. Requirement's: A nice-to-have (but only a nice-to-have) would be some way to capture the F0 contour of the input. This is beyond the state-of-the-art, but future end-to-end systems might make use of the intonation to gleaning additional meaning. This is only worth adding if the group wants to include "next-generation" requirements in the current document. (Please note that this mirrors the NL document, in "Input Requirements" #5, which states: Accept prosodic notations on the ASR output. (Nice to specify). Why isn't this in the grammar representation document? In general on Dialogue Requirements: this document talks about sub-dialogues, etc., when the application determines that more information is needed. But where is the logic to decide this? Where is the interface with the back-end? If it assumed to be inside of this "dialog" module (if it is a module), they should explicitly say so. Or they should say that the application itself (the browser) takes care of it. (I'm just looking for the functionality provided by the Turn Manager) 1.0 The grammar representation should include the definition of statistical grammars - those not defined by finite state or by context free grammars. (Not all speech recognizers use a finite neither state nor context free grammars. Additionally, the use of multi-modality related grammars might not fall into these categories either.) If section 1.3 were to support statistical grammars, then I recommend that 1.3 be upgraded from "should specify" to "must specify". The grammar representation should include a definition of the alphabet mapping. Furthermore, the developer should have a means to map the language alphabet to speech acoustic models. Without a mapping to acoustic models, it is difficult to image how words are correctly recognized and how out-of-vocabulary words are supported. The grammar definition should provide support for the language and sub-language dialects. I assume that the grammar defined in this section is for voice input and output. I assume the multi-modal section provides for the definition of non-speech grammar such as images/graphics. 1.4 Out-of-Vocabulary Words (must specify) How would out-of-vocabulary words be detected? Would a threshold or confidence score be used? Why wouldn't the recognizer just fail to recognize the word (with presumably low confidence score)? Should we add some confidence weights? Or, does this section reflect a requirement as a mechanism to introduce an OOV word? One idea is to use a meta-command such as "Accept new word" followed by "foo" or "eff-oh-oh" or something? This technique must follow the phone definitions as well as the phonetic-to-acoustic mapping. 4.0 I would like to see more explanation and definition of the word "semantics". A lack of consistent meaning could lead to inconstant interoperations by various developers. 4.2 Semantic tagging -- One criterion possibly missing here is some distinguished way of indicating the ontological type of the interpretation. This may be accomplished by identifying a distinguished attribute in the AV representation returned. 5.0 It is not clear what Post Processing is for dialogue? Is assume that post-processing means that the recognizer can retrieve grammar rules and specifications that occurred for a particular utterance? Is this the case? 6.1 I am not sure of the meaning of "native format". I would have assumed native format to mean a particular format from a vendor or computer hardware, or this could be implying binary format. Would a better title be "machine or vendor specific format"? 7.0 This section seems to imply that the dialogue must support XML. Would a specification such as "a developer must provide a translator that translates from/to the XML format" be appropriate? 8.6 Add sentence that says that if weights are not specified, then the default is equally likely. ----------------------------------------------------------- Natural Language Processing Some general comments: First, this section is open to many interoperations and many developers could have different strategies. The specifications are vague on the exact contents of what are in the messages. (I can understand why - the problem is hard.) However, by not specifying the exact content - it becomes extremely difficult to build this component. Is this group intertwined with the "applications group"? Second we in the Communicator project have a number of groups working on the same task, with very similar (intended to be common) data, and we can't agree on the contents of a frame representation for that single domain. Agreeing on a "syntactic" encoding of the output of NL is a small first step; the real problem is the domain modeling, which you don't even have a group for. Third, at the very least, you should consider defining the interpretations of common elements like names, money, dates, time expressions, etc. Communicator participants can find one such attempt at http://fofoca.mitre.org/members/CAC/frame_exchange_proposal_v1.html. If the W3C is interested, MITRE would be happy to make this document available to them, with the understanding that this is a MITRE proposal to the Communicator project which has not been accepted yet (again, an indication of how difficult it is to converge on even the simplest ontological issues). Fourth, under "Task-specific information", sub-heading "Represent task information", requirements 3 and 4 (handle anaphora/ellipsis, and deixis) are hard to do and are not solved problems. They might be labeled "should-have" instead of "must-have". Fifth, the NL requirements have no mention of context tracking (including dialogue history). Are we to assume that context tracking functionality is buried under the above requirements to handle anaphora/ellipsis/deixis? Sixth, many of the requirements throughout this document are intertwined with those of the dialog document (like ask for help, repeat last statement). They don't appear to have been cross-checked with the dialog document, so that both are in sync (that is, both documents have the same requirements, and with the same level of desirability (must-have, should-have, etc.)) There have been many examples where the Dialog requirements mention one priority and the NLP another (for example: 2. Represent a request for help (must specify) in NL and "should have" in Dialog) Input Requirements- 1. Considered a must have, however there is no standard for acoustic scores. How would NLU units account for different syntax/meaning of acoustic scores for different recognizers for example? Should either restrict the must have to a list of sentences and a "should have" optional acoustic scores or remove acoustic scores from here 3. I see no reason why this should not be a "must have". It is simple enough and should be included in the first draft (e.g. keyboard input, really good recognizer) 4. Again it worries me that word lattice (which is not a standard) be required as a "must have". ----------------------------------------------------------- Speech Synthesis General Comment: Where is natural language generation? Or any output generation? They have something like it in the Synthesis document (section 3.11), but that's a "future revision". 2.5: Do you REALLY want to require multi-modal synthesis? This is awfully ambitious. 3.8 Special text constructs: I'd require that these markup elements be mirrored in the grammar specifications. Since they imply a simple ontology, it would be appropriate for this ontology to be consistently applied throughout the system. 4.3 Intonation control (should have) This should be a "must-have", especially since the next requirement (Acoustic Prosodics) is a must-have. Intonation control is as important as acoustic prosodics, and is just as doable. (E.g., "Acoustic Prosodics" specifies pitch and pitch range, which is related to what is needed for "Intonation Control").
Received on Monday, 17 January 2000 05:04:31 UTC