my last thoughts about VXML from Dennis de Champeaux on 2004-10-08 (www-voice@w3.org from October to December 2004)

From: Dennis de Champeaux <ddc@cutter.rexx.com>
Date: Fri, 8 Oct 2004 12:53:44 -0700 (PDT)
To: scott.mcglashan@hp.com (McGlashan, Scott)
Cc: www-voice@w3.org, ddc@cutter.rexx.com (Dennis de Champeaux)
Message-Id: <200410081953.i98JriP67214@cutter.rexx.com>
My final thoughts about VXML

Before rounding up with more comments first acknowledgements:
- Due to the hard work already done on VXML it was relatively easy to
provide a phone interface to the fairly complex self-diagnosis site
HealthCheck4Me.info
- I must thank also those that have replied to my earlier reported
experiences and comments. 

It has been good to take some distance from my experiences so that I
can volunteer some "deeper" thoughts about VXML.  I DO realize now
that my earlier suggestions were too naive given the challenges in
this realm.  Below some thoughts mainly from a software system
development and maintenance perspective.  It is well known that the
costs of system development are just a fraction of the total life
cycle costs.  I will argue below that VXML is quite problematic from
this perspective.  I add some linguistic thoughts as well although I
am very unqualified in that domain.

Let me distinguish first three distinct components of VXML:
- Speech generation
- Speech recognition
- Dialog control


Speech generation
-----------------

HTML as a text mark up language has been been the model for VXML as a
speech markup language.  Both of them are OK (apart from my earlier
comments about strange restrictions in VXML) when the text/speech is
static; i.e. NOT dynamicly generated.  Static text can be easily
produced by any editor or by simple tools and thus their development
and maintenance is straightforward.

HTML for dynamic text and VXML for dynamic speech generated by
webservers is quite problematic.  It requires writing correct code
that when executed at a later stage generates correct code in a
different language to be executed or interpreted in a remote
environment.  This is ideal from a hacker's and job protection
perspective, but is a solid disaster from the perspective of
development and maintenance.

Incrementality and lack of forsight is the "justification" for how
dynamic HTML must now be generated.  It is quite unfortunate - and to
be very frank it is mysterious to me - that VXML has replicated this
historical disaster.  Here two sketches for strawman alternatives to be
replaced by something better: 

(1) 
- Write a Java function SAY with as many arguments as needed that
produces an annotated expression that can be fed into a speech
synthesizer 
- Compile this function at the server side 
- Write Java (servlet) code that produces the actual arguments and
subsequently ships the arguments as well the byte code of the SAY
function to the speech server 
- The speech server obtains the byte code of the SAY function as well
as the arguments, applies the SAY function to the arguments and feeds
the result in the speech synthesizer 

(2) 
- Create a serializable instance of a Java class Say and assume that
it has an attribute 'pronounce' whose value is a Vector.  Assume as
well an operation addToPronounce that takes two arguments: a string as
well as a modality (male/female, loudness, speed, etc.)  
- Add as many sentence (fragments) with the proper modality as needed
- Serialize the instance and ship it to the speech server 
- The speech server recreates the instance of Say and executes
an operation sayIt(), which does the proper things with the data in
the attribute 'pronounce'

The key advantage of both alternatives is that only a single language
needs to be used and that a compile time check ascertains syntactic
correctness.



Speech recognition
------------------

All problems described in the previous section applies here as well
and a similar alternative is recommended.

In addition, it seems to me (bear in mind I am NOT a {computational}
linguist) that the VXML speech recognition machinery does not scale to
more advanced processing.  The Woods ATN's (1970), which I mentioned
earlier, had a clear separation between the:
- lexicon in which the entries can be marked with the standard
linguistic categories as well as with random semantic features, and 
- the network that captures sophisticated grammar's, allowing standard
linguistic syntax rules.  

VXML, in contrast, appears to be currently quite primitive while I did
not see a migration path to the ATN functionality [I do like the
support for reprompting/ help/ etc.]


Dialog control 
--------------

The two previous components allow to write linear sequences consisting
of speech generation and speech generation.  Linear sequences are
good because they have minimal complexity, are easy to write and easy
to maintain.  It would be attractive to limit the allowed forms even
more to just:
-- speech generation only
-- speech generation for a prompt request followed by a speech
recognition component 
More complex dialogs would then be the responsibility for the
components residing at the server side and NOT for code in the speech
server!

This would remove the very ugly, unwieldy, horrendous control
primitives (if, else, goto) as well as the need for yet another
(Javascript) language.  This would obviate the need for the 
generation of truely nightmarish code.


Epilogue
--------

I have been asked what I mean with "pragmatics".  The philosopher
Austin published around 1948 a landmark paper "How to do Things with
Words".  He stressed the importance of a 3rd dimension of language
beyond syntax and semantics: the different ways how language
expressions can be USED: pragmatics.  He worked on identifying
different so called performatives (stating, requesting, informing,
querying, commanding, declaring, etc. etc.) that could lead (perhaps)
to a "grammar" of utterances.

(German) linguists have made progress in this area, but I am not aware
of comparable work in computational linguistics.  Here two lines of
input to a "smart" assistant:
assertion: I am hungry.
question: Do you know where my purse is? {resolving the anaphoric
reference is here easy} 
Suppose we ask now the assistant: Why is he looking for his purse?
A system capable of answering 'stupidly' "to eat it" is beyond the
state of the art (after many decades of NL research - as far as I
know). 

By removing dialog control from VXML (or the like) one can relegate
pragmatic choices to units at the server side, which is, for now, the
proper choice I believe ...

Apologies if this posting is total nonsense/ irrelevant/ a waste of
bandwidth for your group.

Cheers anyway,

-- 
Dennis de Champeaux   OntoOO Inc  email: ddc@ontooo.com & ddc@acm.org
Page: 408 581 2185    Mesg: 408 559 7264
Address: 14519 Bercaw Ln, San Jose, CA 95124, USA

I am OK. You are OK?  If in doubt: www.HealthCheck4Me.info
Received on Friday, 8 October 2004 19:53:49 UTC