RE: Comments on VoiceXML 2.0

Samuel,

Below is our official reply to your comments. We apologize for the
delay.

If you are not satisfied with the reply, or want more information, let
me know.

Scott

(Chairman, W3C VB Dialog Committee)

------------------------------------------------------------------------
--------

Thank you for your email "Comments on VoiceXML 2.0"

http://lists.w3.org/Archives/Public/www-voice/2001OctDec/0034.html


In your email you indicate that you are disappointed by VoiceXML
2.0 for two reasons:

1. Current patent encumbrances are profoundly inappropriate.

2. The combination of markup and code exhibited in the specification
represent a serious design flaw.

You then elaborate on these reasons. Our response is given below.


Patent Response
---------------

You note that you find the current patent encumbrances profoundly
inappropriate, and that you would like W3C to insist that all patent
claimants pledge that the relevant IP be made available on a
royalty-free basis. You don't specify what should happen in the
event that such pledges aren't made.

On 1 March 2001, the W3C Advisory Board suggested that the W3C Team
begin to implement the draft patent policies of the Patent Policy
Working Group. Concerns raised by W3C Members led to W3C launching
a Voice Browser Patent Advisory Group in May 2001, following the
procedures outlined in the draft patent policies.  All organizations
participating in the Voice Browser working group were required to
provide patent disclosure statements, see:

  http://www.w3.org/2001/09/voice-disclosures.html

This revealed essential claims that were offered on a non-RF basis.
After considerable discussion, the PAG was unable to reach a consensus
on the formal recommendation it should offer to W3C.  A majority of
participants were in favor of W3C proceeding with a RAND specification
for VoiceXML, while a minority insisted that either a royalty free
specification be produced or W3C should cease work on VoiceXML.

The matter was referred to the W3C Management Team who have not made
a decision on the PAG result, since such a decision was difficult due
to the ongoing work in the W3C patent policy WG, and on the "Curent
Patent Practices" document. However, the pieces needed for a decision
seem to be slowly coming into place now. In response to requests to
clarify current W3C patent practice, a document was prepared by the
W3C Team and reviewed by the W3C Advisory Board. In January, the
Advisory Board recommended that this be published as a W3C Note,
see:

  http://www.w3.org/TR/2002/NOTE-patent-practice-20020124

The current practice described in the Note allows for specifications
to be produced on RAND terms, but makes it clear that there is neither
clear support amongst the Membership for producing RAND specifications
nor a process for doing so.  Therefore if a PAG makes a recommendation
to proceed on RAND terms, Advisory Committee review and Director's
decision will be required. It is also possible that a PAG could
recommend that the work be taken to another organization.


Technical Response
------------------

The mixing of declarative markup and procedural code features (such as
elements for control transfer (e.g. <goto>), variable declaration and
assignments (e.g. <assign>), conditionals (e.g. <if>, <then>) and
'function calls' (e.g. <subdialog>) are features which were part of
VoiceXML 1.0 submitted to W3C in May 2000. VoiceXML 2.0 is based on
VoiceXML 1.0.

VoiceXML represents considerable prior experience and expertise from
those developing and deployment dialog languages and systems. As was
pointed out in the public response by Nils Klarlund on Friday 9th
November 2001:

http://lists.w3.org/Archives/Public/www-voice/2001OctDec/0035.html

many of the issues you raise were discussed during the creation of
VoiceXML 1.0. As Nils points out, VoiceXML is a compromise "that
attempts to merge the declarative aspect of dialogues with the
need for computational meaning".

Like HTML for visual interaction, a markup language for dialog
interaction needs to address issues of execution as well as data
declaration. VoiceXML follows the XML Events Model in terms of
how events are handled (analogous to HTML's use of DOM Level 2
event model). Just as HTML allows scripting elements within the
language, so VoiceXML allows ECMAScript via its <scriptelement.
However, VoiceXML also has a number of elements, such as <var>,
<if>, <assignand <gotoas part of its 'executable content', and the
existence of these elements can, as you correctly point out, be seen as
weakness in the language.  The same functions could be achieved by
allowing ECMAScript to have additional bindings which access elements
in a Document Object Model of VoiceXML. It is unclear whether this
approach would make the application developer's job easier, but it would
certainly require fundamental changes in a language which, we believe,
already has a judicious blend of the declarative and procedural suitable
for many dialog applications.

As was agreed during W3C Voice Browser meetings and teleconferences
(which participants from Mitre attended), the technical goals of
VoiceXML 2.0 were limited to ensuring interoperability, clarification 
and solidification of VoiceXML 1.0, an existing and deployed industry
specification. We also established a process by which members of the
group could submit requests for changes in the specification. Requests
are discussed and acted upon when they meet these goals. Your specific
points regarding inconsistencies in the procedural aspects of VoiceXML
can be addressed through this process. However, making fundamental
changes in the language, as would be required to address
declarative/procedural mixing, was not one of the goals for VoiceXML
2.0. We will certainly consider ways of addressing this issue but in 
later versions of the language.

In conclusion, we do not believe that the current design characteristics
of VoiceXML 2.0 "suffers from significant problems of clarity and
coherence" which will impede its widespread use. There are already many
companies and individuals developing and commercially deploying VoiceXML
2.0 platforms and applications, and we believe their number will
increase with VoiceXML 2.0 as a W3C Recommendation.

-----Original Message-----
From: Samuel L. Bayer [mailto:sam@mitre.org]
Sent: 08 November 2001 21:33
To: www-voice@w3.org
Cc: djweitzner@w3.org
Subject: Comments on VoiceXML 2.0



All -

As part of the MITRE Corporation's participation in the DARPA
Communicator program, we have joined and actively followed the
activities of the W3C Voice Browsers working group. We have tremendous
respect for the effort and time invested by the participants in this
effort. We have tracked the development of all of the working drafts
issued by the W3CVB, and provided feedback to the committee on the
details of these working drafts.

However, up to this point, we have offered these comments mostly as
observers. There have been a number of issues which we did not feel
comfortable addressing, mostly because we will not be producing a
product based on the W3CVB standards. We now recognize that this
position was far too narrow, and that as a corporation chartered in
the public interest, we have obligations both to the potential users
of such products and to our US Government sponsors.

In recognition of these responsibilities, we offer the following
comments about the VoiceXML 2.0 specification. Up to now we have been
quite impressed with the usefulness of the proposed standards and the
clarity of thought that lay behind them. The VoiceXML 2.0
specification, on the other hand, leaves us disappointed, for two
important reasons. First, we find the current patent encumbrances
profoundly inappropriate, and second, we believe that the combination
of markup and code exhibited in the specification represents a serious
design flaw. We elaborate on these objections below.

Patent encumbrances
-------------------

While we recognize that the W3CVB has intentionally chosen to release
the VoiceXML 2.0 draft without resolving the issue of patent
encumbrances, we must emphasize that we find the so-called RAND
("reasonable and non-discriminatory") conditions incompatible with the
goals of an international standards body committed to open access. At
the very minimum, the W3C ought to insist that all patent claimants
pledge that the relevant IP be made available on a royalty-free
basis. We reach this conclusion as both a contractor for the US
Government and a representative private corporation which may, in the
future, choose to exploit the technology built on this proposed
standard.

Our primary interest here is that W3C Recommendations explicitly
guarantee the ability to write and distribute free, open source
software based on those Recommendations. RAND does not do this. In
particular, as Daniel Weitzner, the chairman of the W3C Patent Policy
Working Group, acknowledged in an interview on Slashdot, "so far [the
PPWG has] decided that we do not have a good mechanism within W3C for
assessing or defining reasonableness".

The consequence of this problem is that so-called "reasonable" royalty
requirements, as long as they actually require money to be paid, can
be used as a form of de facto discrimination, even if the distribution
conditions are "non-discriminatory". To drive the point home: RAND
doesn't guarantee that patent holders would make the appropriate
financial arrangements to allow the development of open-source
implementations of the W3C Recommendations. In point of fact, it would
arguably not be in the patent holder's best interest to do so.

The absence of a guarantee of the possibility of an open-source
implementation has two important consequences. First, if the W3C
cannot guarantee royalty-free access, there is a strong likelihood
that the acceptance of the standard would be compromised, and even the
possibility that an open-source "fork" might develop. Second, both as
a US Government contractor and a potential consumer of this
technology, we don't believe that MITRE can support a standard which
would fail to guarantee the possibility of free, open-source
alternatives to commercial tools, since the potential cost both to the
US Government and to the private sector could be considerable. It's
not that we believe that the world has a limitless right to
open-source software alternatives; but we do believe that it's
profoundly inappropriate for a standards organization like the W3C to 
render it impossible to produce a free conforming implementation. 

It's worth pointing out that all of the tools which have driven the
success of the WWW to date - HTML, HTTP, XML, the Apache Web server -
have all been patent-free, and it's easy to imagine that if any of
these technologies had been patent-encumbered, and stood as the only
possible option for the functionality desired, the WWW (and the W3C)
might have been dead in the water from the beginning. From our point
of view, guarantee of royalty-free access is an absolute requirement
for this and all other W3C Recommendations.

Technical issues
----------------

In general, the working drafts issued by the W3CVB achieve an
admirable level of quality and thoughtfulness. However, we believe
that there is a fundamental problem with the VoiceXML 2.0
specification that must be addressed before the W3C approves it.

The central problem with the Voice Extensible Markup Language is 
that in some ways, it's trying to be markup, and in other ways, it's 
trying to be code. Because of the tradeoffs made in order to accomplish 
both these goals, it simply does not meet basic design criteria for
programming languages, such as consistency and readability.

That VoiceXML is partially code isn't hard to demonstrate. There are
tags for transfer of control (<throw>, <catch>, <goto>, although the
scope of throw/catch is local, not global), variable declaration and
value assignment (<var>, <assign>), conditionals (<if>, <elseif>,
<else>), and what amounts to function calls on a stack (<subdialogue>,
<return>).

You can also declare chunks of actual code using the <script> tag, but
you can't pick your scripting language, as you can in HTML; ECMAScript
is explicitly required by the specification, and the specification
will not work without it. For instance, the <if> tag has a "cond"
attribute for the condition of the conditional, which is specified to
be ECMAScript code; similarly for the <assign> tag:

<if cond="flavor == 'vanilla'"> 
  <assign name="flavor_code" expr="'v'"/> 
<elseif cond="flavor == 'chocolate'"/> 
  <assign name="flavor_code" expr="'h'"/> 
<elseif cond="flavor == 'strawberry'"/> 
  <assign name="flavor_code" expr="'b'"/> 
<else/> 
  <assign name="flavor_code" expr="'?'"/> 
</if>

Certainly, part of VoiceXML 2.0 deals with semantic data descriptions,
such as the prompts and other properties associated with voice-driven
menus. But at its heart, this specification reaches far beyond typical
applications of markup, and represents an attempt to use a
markup language as a programming language. But since the markup
language is inadequate, it needs to be augmented with an actual
programming language (in this case, ECMAScript), and careful
correspondence rules between the markup language and the programming
language need to be made, and the specification needs to be salted
with special flow-of-control rules about the interpretation of the
MARKUP (such as the local scope of <throw>) which are properly the
domain of a programming language.

To make matters worse, the reserved symbols of the programming
language are subordinate to the reserved symbols of the markup
language. Here's a quote from the VoiceXML 2.0 specification:

    The expression language used in cond and expr is precisely
    ECMAScript. Note that the cond operators "<", "<=", and "&&" must
    be escaped in XML (to "&lt;" and so on). For clarity, examples in
    this document do not use XML escapes. 

Note "for clarity"; the specification doesn't even use its actual
syntax in its examples. This readability issue with reserved symbols
is a symptom of the problem. XML was originally developed as an
extensible, textual semantic description of data which had the
flexibility of SGML without many of the expensive details. As such,
XML was designed to be easily parsed and processed, but not to be
READ. Unfortunately, visual programming techniques have not progressed
to the point where VoiceXML programs could be constructed and digested
by programmers without reading the actual text, and the actual text is
illegible: it's a mishmash of markup and code, where the code is
compromised by the reserved symbols of the markup language. Let's
compare a couple examples:

VoiceXML:

<if cond="i &lt; 1"> 
  <assign name="i" expr="i+1"/> 
</if>

ECMAScript alone:

if (i < 1) {
  i = i + 1
}

Exporting the flow of control into the markup language provides no
discernible benefit to the code being written, and makes the resulting
program (and it IS a program) significantly harder to read.

Furthermore, the specification doesn't seem to have been consistently
designed. There are a variety of ways of handling variable binding and
assignment:

- The <assign> tag is illustrated above.

- The <param> tag is used to pass values to a <subdialog>. However,
  its "name" attribute specifies the name that the SUBDIALOGUE should
  know it by, and the value in the subdialogue is specified by either
  an "expr" or a "value" attribute on <param>. In other words, this
  process differs significantly from function calls in a typical
  programming language, because the caller, rather than the callee,
  determines the name of the local variable in the callee.

- The <subdialog> tag also contains a "name" attribute, which declares
  the name of the value returned from the subdialogue. The
  subdialogue, in turn, has a <return> tag with a "namelist"
  attribute; the namelist is a space-separated list of local names
  whose values should be returned. The way these match up (a single
  name in the caller, a list of names in the callee) is that the
  elements in the namelist end up being ECMAScript attributes of an
  ECMAScript object stored in the parent's name. So if the caller has
  'name="foo"', and the callee's <return> has 'namelist="bar baz"',
  the caller will know the two return values as foo.bar and
  foo.baz. This is the case even when there's a single value
  returned.

- The <throw> tag has two attributes, "event" and "message", which
  correspond to the error class and error description in nonlocal
  control constructs in programming languages. However, the <catch>
  tag doesn't bind these attributes explicitly; there are two special
  variables, "_event" and "_message", which are set up when a <catch>
  is invoked. 

It's impossible to avoid the conclusions that (a) with all these
control structures, this is a programming language, and (b) as a
programming language, its design suffers from significant problems
of clarity and coherence.

Finally, we come to the issue of extensibility. One of the great
successes of many previous W3C specifications is that they have been
tremendously resilient in the face of extensions, and have done a good
job of anticipating the variety of uses they might be put to. In
general, the W3C specifications have achieved this by identifying and
concentrating on the fundamental building blocks and thus constructing
a "toolkit" out of which many applications can be built. XML is an
exceptional example of such a success. The VoiceXML 2.0 specification,
regrettably, is not. For example, it has anticipated some throw/catch
behavior with tags like <help> and <noinput>, but left the rest to be
defined as ECMAScript classes, some predefined, some
user-defined. This represents the beginnings of using ECMAScript as a
"grab bag" for whatever doesn't fit in the XML tags, and we predict
that this will be the typical path for evolution in this
specification. Under these circumstances, the XML tags themselves,
especially for flow of control, will look more and more like
impediments and design mistakes, rather than the manifestation of
forward-looking design.

The solution, we believe, is to acknowledge that the flow of control
issues that VoiceXML is trying to address are simply not the domain of
markup languages, but rather programming languages. What is needed is
a specification which clearly separates the data description issues
(such as the configuration of voice menus) from the flow of control
(which ought to be left to programming languages which have actually
been designed as such). Once we do this, large portions of the
specification become superfluous (such as those which specify the
relationship between the scoping rules of ECMAScript and of
VoiceXML). 

Note that our recommendation implies that, since the W3C really isn't
in the business of designing programming languages, there's a large
chunk of dialogue processing which is simply outside the scope of the
W3C charter. We believe that this is a perfectly appropriate outcome;
there's no reason to believe that all of dialogue falls under the W3C
purview. In addition, this specification contrasts rather severely
with the other uses that the W3CVB has put XML to: static data
descriptions (e.g., the speech recognition grammar specification) and
dynamic data descriptions (e.g., the speech synthesis specification,
which describes markup for input to a speech synthesizer). As we said
above, we recognize that there are some data descriptions in the
specification; but the attempt in VoiceXML 2.0 to unify them with the
flow-of-control requirements of dialogue is, to our mind,
fundamentally flawed.

Summary
-------

We have two primary objections to the VoiceXML 2.0 specification in
its current form: its questionable status with regard to free
implementations, and its flawed design as a programming language. We
do not believe that the W3C should endorse the VoiceXML 2.0
specification without significant revisions which address both these
issues. 

Respectfully submitted,

Samuel Bayer
John Aberdeen
Bryan George
Alan Goldschen
Lynette Hirschman
Bede McCall

The MITRE Corporation

Received on Saturday, 16 February 2002 18:53:25 UTC