- From: Dennis de Champeaux <ddc@cutter.rexx.com>
- Date: Fri, 8 Oct 2004 12:53:44 -0700 (PDT)
- To: scott.mcglashan@hp.com (McGlashan, Scott)
- Cc: www-voice@w3.org, ddc@cutter.rexx.com (Dennis de Champeaux)
My final thoughts about VXML Before rounding up with more comments first acknowledgements: - Due to the hard work already done on VXML it was relatively easy to provide a phone interface to the fairly complex self-diagnosis site HealthCheck4Me.info - I must thank also those that have replied to my earlier reported experiences and comments. It has been good to take some distance from my experiences so that I can volunteer some "deeper" thoughts about VXML. I DO realize now that my earlier suggestions were too naive given the challenges in this realm. Below some thoughts mainly from a software system development and maintenance perspective. It is well known that the costs of system development are just a fraction of the total life cycle costs. I will argue below that VXML is quite problematic from this perspective. I add some linguistic thoughts as well although I am very unqualified in that domain. Let me distinguish first three distinct components of VXML: - Speech generation - Speech recognition - Dialog control Speech generation ----------------- HTML as a text mark up language has been been the model for VXML as a speech markup language. Both of them are OK (apart from my earlier comments about strange restrictions in VXML) when the text/speech is static; i.e. NOT dynamicly generated. Static text can be easily produced by any editor or by simple tools and thus their development and maintenance is straightforward. HTML for dynamic text and VXML for dynamic speech generated by webservers is quite problematic. It requires writing correct code that when executed at a later stage generates correct code in a different language to be executed or interpreted in a remote environment. This is ideal from a hacker's and job protection perspective, but is a solid disaster from the perspective of development and maintenance. Incrementality and lack of forsight is the "justification" for how dynamic HTML must now be generated. It is quite unfortunate - and to be very frank it is mysterious to me - that VXML has replicated this historical disaster. Here two sketches for strawman alternatives to be replaced by something better: (1) - Write a Java function SAY with as many arguments as needed that produces an annotated expression that can be fed into a speech synthesizer - Compile this function at the server side - Write Java (servlet) code that produces the actual arguments and subsequently ships the arguments as well the byte code of the SAY function to the speech server - The speech server obtains the byte code of the SAY function as well as the arguments, applies the SAY function to the arguments and feeds the result in the speech synthesizer (2) - Create a serializable instance of a Java class Say and assume that it has an attribute 'pronounce' whose value is a Vector. Assume as well an operation addToPronounce that takes two arguments: a string as well as a modality (male/female, loudness, speed, etc.) - Add as many sentence (fragments) with the proper modality as needed - Serialize the instance and ship it to the speech server - The speech server recreates the instance of Say and executes an operation sayIt(), which does the proper things with the data in the attribute 'pronounce' The key advantage of both alternatives is that only a single language needs to be used and that a compile time check ascertains syntactic correctness. Speech recognition ------------------ All problems described in the previous section applies here as well and a similar alternative is recommended. In addition, it seems to me (bear in mind I am NOT a {computational} linguist) that the VXML speech recognition machinery does not scale to more advanced processing. The Woods ATN's (1970), which I mentioned earlier, had a clear separation between the: - lexicon in which the entries can be marked with the standard linguistic categories as well as with random semantic features, and - the network that captures sophisticated grammar's, allowing standard linguistic syntax rules. VXML, in contrast, appears to be currently quite primitive while I did not see a migration path to the ATN functionality [I do like the support for reprompting/ help/ etc.] Dialog control -------------- The two previous components allow to write linear sequences consisting of speech generation and speech generation. Linear sequences are good because they have minimal complexity, are easy to write and easy to maintain. It would be attractive to limit the allowed forms even more to just: -- speech generation only -- speech generation for a prompt request followed by a speech recognition component More complex dialogs would then be the responsibility for the components residing at the server side and NOT for code in the speech server! This would remove the very ugly, unwieldy, horrendous control primitives (if, else, goto) as well as the need for yet another (Javascript) language. This would obviate the need for the generation of truely nightmarish code. Epilogue -------- I have been asked what I mean with "pragmatics". The philosopher Austin published around 1948 a landmark paper "How to do Things with Words". He stressed the importance of a 3rd dimension of language beyond syntax and semantics: the different ways how language expressions can be USED: pragmatics. He worked on identifying different so called performatives (stating, requesting, informing, querying, commanding, declaring, etc. etc.) that could lead (perhaps) to a "grammar" of utterances. (German) linguists have made progress in this area, but I am not aware of comparable work in computational linguistics. Here two lines of input to a "smart" assistant: assertion: I am hungry. question: Do you know where my purse is? {resolving the anaphoric reference is here easy} Suppose we ask now the assistant: Why is he looking for his purse? A system capable of answering 'stupidly' "to eat it" is beyond the state of the art (after many decades of NL research - as far as I know). By removing dialog control from VXML (or the like) one can relegate pragmatic choices to units at the server side, which is, for now, the proper choice I believe ... Apologies if this posting is total nonsense/ irrelevant/ a waste of bandwidth for your group. Cheers anyway, -- Dennis de Champeaux OntoOO Inc email: ddc@ontooo.com & ddc@acm.org Page: 408 581 2185 Mesg: 408 559 7264 Address: 14519 Bercaw Ln, San Jose, CA 95124, USA I am OK. You are OK? If in doubt: www.HealthCheck4Me.info
Received on Friday, 8 October 2004 19:53:49 UTC