Comments on VoiceXML 2.0

All -

As part of the MITRE Corporation's participation in the DARPA
Communicator program, we have joined and actively followed the
activities of the W3C Voice Browsers working group. We have tremendous
respect for the effort and time invested by the participants in this
effort. We have tracked the development of all of the working drafts
issued by the W3CVB, and provided feedback to the committee on the
details of these working drafts.

However, up to this point, we have offered these comments mostly as
observers. There have been a number of issues which we did not feel
comfortable addressing, mostly because we will not be producing a
product based on the W3CVB standards. We now recognize that this
position was far too narrow, and that as a corporation chartered in
the public interest, we have obligations both to the potential users
of such products and to our US Government sponsors.

In recognition of these responsibilities, we offer the following
comments about the VoiceXML 2.0 specification. Up to now we have been
quite impressed with the usefulness of the proposed standards and the
clarity of thought that lay behind them. The VoiceXML 2.0
specification, on the other hand, leaves us disappointed, for two
important reasons. First, we find the current patent encumbrances
profoundly inappropriate, and second, we believe that the combination
of markup and code exhibited in the specification represents a serious
design flaw. We elaborate on these objections below.

Patent encumbrances

While we recognize that the W3CVB has intentionally chosen to release
the VoiceXML 2.0 draft without resolving the issue of patent
encumbrances, we must emphasize that we find the so-called RAND
("reasonable and non-discriminatory") conditions incompatible with the
goals of an international standards body committed to open access. At
the very minimum, the W3C ought to insist that all patent claimants
pledge that the relevant IP be made available on a royalty-free
basis. We reach this conclusion as both a contractor for the US
Government and a representative private corporation which may, in the
future, choose to exploit the technology built on this proposed

Our primary interest here is that W3C Recommendations explicitly
guarantee the ability to write and distribute free, open source
software based on those Recommendations. RAND does not do this. In
particular, as Daniel Weitzner, the chairman of the W3C Patent Policy
Working Group, acknowledged in an interview on Slashdot, "so far [the
PPWG has] decided that we do not have a good mechanism within W3C for
assessing or defining reasonableness".

The consequence of this problem is that so-called "reasonable" royalty
requirements, as long as they actually require money to be paid, can
be used as a form of de facto discrimination, even if the distribution
conditions are "non-discriminatory". To drive the point home: RAND
doesn't guarantee that patent holders would make the appropriate
financial arrangements to allow the development of open-source
implementations of the W3C Recommendations. In point of fact, it would
arguably not be in the patent holder's best interest to do so.

The absence of a guarantee of the possibility of an open-source
implementation has two important consequences. First, if the W3C
cannot guarantee royalty-free access, there is a strong likelihood
that the acceptance of the standard would be compromised, and even the
possibility that an open-source "fork" might develop. Second, both as
a US Government contractor and a potential consumer of this
technology, we don't believe that MITRE can support a standard which
would fail to guarantee the possibility of free, open-source
alternatives to commercial tools, since the potential cost both to the
US Government and to the private sector could be considerable. It's
not that we believe that the world has a limitless right to
open-source software alternatives; but we do believe that it's
profoundly inappropriate for a standards organization like the W3C to 
render it impossible to produce a free conforming implementation. 

It's worth pointing out that all of the tools which have driven the
success of the WWW to date - HTML, HTTP, XML, the Apache Web server -
have all been patent-free, and it's easy to imagine that if any of
these technologies had been patent-encumbered, and stood as the only
possible option for the functionality desired, the WWW (and the W3C)
might have been dead in the water from the beginning. From our point
of view, guarantee of royalty-free access is an absolute requirement
for this and all other W3C Recommendations.

Technical issues

In general, the working drafts issued by the W3CVB achieve an
admirable level of quality and thoughtfulness. However, we believe
that there is a fundamental problem with the VoiceXML 2.0
specification that must be addressed before the W3C approves it.

The central problem with the Voice Extensible Markup Language is 
that in some ways, it's trying to be markup, and in other ways, it's 
trying to be code. Because of the tradeoffs made in order to accomplish 
both these goals, it simply does not meet basic design criteria for
programming languages, such as consistency and readability.

That VoiceXML is partially code isn't hard to demonstrate. There are
tags for transfer of control (<throw>, <catch>, <goto>, although the
scope of throw/catch is local, not global), variable declaration and
value assignment (<var>, <assign>), conditionals (<if>, <elseif>,
<else>), and what amounts to function calls on a stack (<subdialogue>,

You can also declare chunks of actual code using the <script> tag, but
you can't pick your scripting language, as you can in HTML; ECMAScript
is explicitly required by the specification, and the specification
will not work without it. For instance, the <if> tag has a "cond"
attribute for the condition of the conditional, which is specified to
be ECMAScript code; similarly for the <assign> tag:

<if cond="flavor == 'vanilla'"> 
  <assign name="flavor_code" expr="'v'"/> 
<elseif cond="flavor == 'chocolate'"/> 
  <assign name="flavor_code" expr="'h'"/> 
<elseif cond="flavor == 'strawberry'"/> 
  <assign name="flavor_code" expr="'b'"/> 
  <assign name="flavor_code" expr="'?'"/> 

Certainly, part of VoiceXML 2.0 deals with semantic data descriptions,
such as the prompts and other properties associated with voice-driven
menus. But at its heart, this specification reaches far beyond typical
applications of markup, and represents an attempt to use a
markup language as a programming language. But since the markup
language is inadequate, it needs to be augmented with an actual
programming language (in this case, ECMAScript), and careful
correspondence rules between the markup language and the programming
language need to be made, and the specification needs to be salted
with special flow-of-control rules about the interpretation of the
MARKUP (such as the local scope of <throw>) which are properly the
domain of a programming language.

To make matters worse, the reserved symbols of the programming
language are subordinate to the reserved symbols of the markup
language. Here's a quote from the VoiceXML 2.0 specification:

    The expression language used in cond and expr is precisely
    ECMAScript. Note that the cond operators "<", "<=", and "&&" must
    be escaped in XML (to "&lt;" and so on). For clarity, examples in
    this document do not use XML escapes. 

Note "for clarity"; the specification doesn't even use its actual
syntax in its examples. This readability issue with reserved symbols
is a symptom of the problem. XML was originally developed as an
extensible, textual semantic description of data which had the
flexibility of SGML without many of the expensive details. As such,
XML was designed to be easily parsed and processed, but not to be
READ. Unfortunately, visual programming techniques have not progressed
to the point where VoiceXML programs could be constructed and digested
by programmers without reading the actual text, and the actual text is
illegible: it's a mishmash of markup and code, where the code is
compromised by the reserved symbols of the markup language. Let's
compare a couple examples:


<if cond="i &lt; 1"> 
  <assign name="i" expr="i+1"/> 

ECMAScript alone:

if (i < 1) {
  i = i + 1

Exporting the flow of control into the markup language provides no
discernible benefit to the code being written, and makes the resulting
program (and it IS a program) significantly harder to read.

Furthermore, the specification doesn't seem to have been consistently
designed. There are a variety of ways of handling variable binding and

- The <assign> tag is illustrated above.

- The <param> tag is used to pass values to a <subdialog>. However,
  its "name" attribute specifies the name that the SUBDIALOGUE should
  know it by, and the value in the subdialogue is specified by either
  an "expr" or a "value" attribute on <param>. In other words, this
  process differs significantly from function calls in a typical
  programming language, because the caller, rather than the callee,
  determines the name of the local variable in the callee.

- The <subdialog> tag also contains a "name" attribute, which declares
  the name of the value returned from the subdialogue. The
  subdialogue, in turn, has a <return> tag with a "namelist"
  attribute; the namelist is a space-separated list of local names
  whose values should be returned. The way these match up (a single
  name in the caller, a list of names in the callee) is that the
  elements in the namelist end up being ECMAScript attributes of an
  ECMAScript object stored in the parent's name. So if the caller has
  'name="foo"', and the callee's <return> has 'namelist="bar baz"',
  the caller will know the two return values as and
  foo.baz. This is the case even when there's a single value

- The <throw> tag has two attributes, "event" and "message", which
  correspond to the error class and error description in nonlocal
  control constructs in programming languages. However, the <catch>
  tag doesn't bind these attributes explicitly; there are two special
  variables, "_event" and "_message", which are set up when a <catch>
  is invoked. 

It's impossible to avoid the conclusions that (a) with all these
control structures, this is a programming language, and (b) as a
programming language, its design suffers from significant problems
of clarity and coherence.

Finally, we come to the issue of extensibility. One of the great
successes of many previous W3C specifications is that they have been
tremendously resilient in the face of extensions, and have done a good
job of anticipating the variety of uses they might be put to. In
general, the W3C specifications have achieved this by identifying and
concentrating on the fundamental building blocks and thus constructing
a "toolkit" out of which many applications can be built. XML is an
exceptional example of such a success. The VoiceXML 2.0 specification,
regrettably, is not. For example, it has anticipated some throw/catch
behavior with tags like <help> and <noinput>, but left the rest to be
defined as ECMAScript classes, some predefined, some
user-defined. This represents the beginnings of using ECMAScript as a
"grab bag" for whatever doesn't fit in the XML tags, and we predict
that this will be the typical path for evolution in this
specification. Under these circumstances, the XML tags themselves,
especially for flow of control, will look more and more like
impediments and design mistakes, rather than the manifestation of
forward-looking design.

The solution, we believe, is to acknowledge that the flow of control
issues that VoiceXML is trying to address are simply not the domain of
markup languages, but rather programming languages. What is needed is
a specification which clearly separates the data description issues
(such as the configuration of voice menus) from the flow of control
(which ought to be left to programming languages which have actually
been designed as such). Once we do this, large portions of the
specification become superfluous (such as those which specify the
relationship between the scoping rules of ECMAScript and of

Note that our recommendation implies that, since the W3C really isn't
in the business of designing programming languages, there's a large
chunk of dialogue processing which is simply outside the scope of the
W3C charter. We believe that this is a perfectly appropriate outcome;
there's no reason to believe that all of dialogue falls under the W3C
purview. In addition, this specification contrasts rather severely
with the other uses that the W3CVB has put XML to: static data
descriptions (e.g., the speech recognition grammar specification) and
dynamic data descriptions (e.g., the speech synthesis specification,
which describes markup for input to a speech synthesizer). As we said
above, we recognize that there are some data descriptions in the
specification; but the attempt in VoiceXML 2.0 to unify them with the
flow-of-control requirements of dialogue is, to our mind,
fundamentally flawed.


We have two primary objections to the VoiceXML 2.0 specification in
its current form: its questionable status with regard to free
implementations, and its flawed design as a programming language. We
do not believe that the W3C should endorse the VoiceXML 2.0
specification without significant revisions which address both these

Respectfully submitted,

Samuel Bayer
John Aberdeen
Bryan George
Alan Goldschen
Lynette Hirschman
Bede McCall

The MITRE Corporation

Received on Friday, 9 November 2001 04:51:03 UTC