- From: Samuel L. Bayer <sam@mitre.org>
- Date: Thu, 8 Nov 2001 15:32:42 -0500 (EST)
- To: www-voice@w3.org
- CC: djweitzner@w3.org
All - As part of the MITRE Corporation's participation in the DARPA Communicator program, we have joined and actively followed the activities of the W3C Voice Browsers working group. We have tremendous respect for the effort and time invested by the participants in this effort. We have tracked the development of all of the working drafts issued by the W3CVB, and provided feedback to the committee on the details of these working drafts. However, up to this point, we have offered these comments mostly as observers. There have been a number of issues which we did not feel comfortable addressing, mostly because we will not be producing a product based on the W3CVB standards. We now recognize that this position was far too narrow, and that as a corporation chartered in the public interest, we have obligations both to the potential users of such products and to our US Government sponsors. In recognition of these responsibilities, we offer the following comments about the VoiceXML 2.0 specification. Up to now we have been quite impressed with the usefulness of the proposed standards and the clarity of thought that lay behind them. The VoiceXML 2.0 specification, on the other hand, leaves us disappointed, for two important reasons. First, we find the current patent encumbrances profoundly inappropriate, and second, we believe that the combination of markup and code exhibited in the specification represents a serious design flaw. We elaborate on these objections below. Patent encumbrances ------------------- While we recognize that the W3CVB has intentionally chosen to release the VoiceXML 2.0 draft without resolving the issue of patent encumbrances, we must emphasize that we find the so-called RAND ("reasonable and non-discriminatory") conditions incompatible with the goals of an international standards body committed to open access. At the very minimum, the W3C ought to insist that all patent claimants pledge that the relevant IP be made available on a royalty-free basis. We reach this conclusion as both a contractor for the US Government and a representative private corporation which may, in the future, choose to exploit the technology built on this proposed standard. Our primary interest here is that W3C Recommendations explicitly guarantee the ability to write and distribute free, open source software based on those Recommendations. RAND does not do this. In particular, as Daniel Weitzner, the chairman of the W3C Patent Policy Working Group, acknowledged in an interview on Slashdot, "so far [the PPWG has] decided that we do not have a good mechanism within W3C for assessing or defining reasonableness". The consequence of this problem is that so-called "reasonable" royalty requirements, as long as they actually require money to be paid, can be used as a form of de facto discrimination, even if the distribution conditions are "non-discriminatory". To drive the point home: RAND doesn't guarantee that patent holders would make the appropriate financial arrangements to allow the development of open-source implementations of the W3C Recommendations. In point of fact, it would arguably not be in the patent holder's best interest to do so. The absence of a guarantee of the possibility of an open-source implementation has two important consequences. First, if the W3C cannot guarantee royalty-free access, there is a strong likelihood that the acceptance of the standard would be compromised, and even the possibility that an open-source "fork" might develop. Second, both as a US Government contractor and a potential consumer of this technology, we don't believe that MITRE can support a standard which would fail to guarantee the possibility of free, open-source alternatives to commercial tools, since the potential cost both to the US Government and to the private sector could be considerable. It's not that we believe that the world has a limitless right to open-source software alternatives; but we do believe that it's profoundly inappropriate for a standards organization like the W3C to render it impossible to produce a free conforming implementation. It's worth pointing out that all of the tools which have driven the success of the WWW to date - HTML, HTTP, XML, the Apache Web server - have all been patent-free, and it's easy to imagine that if any of these technologies had been patent-encumbered, and stood as the only possible option for the functionality desired, the WWW (and the W3C) might have been dead in the water from the beginning. From our point of view, guarantee of royalty-free access is an absolute requirement for this and all other W3C Recommendations. Technical issues ---------------- In general, the working drafts issued by the W3CVB achieve an admirable level of quality and thoughtfulness. However, we believe that there is a fundamental problem with the VoiceXML 2.0 specification that must be addressed before the W3C approves it. The central problem with the Voice Extensible Markup Language is that in some ways, it's trying to be markup, and in other ways, it's trying to be code. Because of the tradeoffs made in order to accomplish both these goals, it simply does not meet basic design criteria for programming languages, such as consistency and readability. That VoiceXML is partially code isn't hard to demonstrate. There are tags for transfer of control (<throw>, <catch>, <goto>, although the scope of throw/catch is local, not global), variable declaration and value assignment (<var>, <assign>), conditionals (<if>, <elseif>, <else>), and what amounts to function calls on a stack (<subdialogue>, <return>). You can also declare chunks of actual code using the <script> tag, but you can't pick your scripting language, as you can in HTML; ECMAScript is explicitly required by the specification, and the specification will not work without it. For instance, the <if> tag has a "cond" attribute for the condition of the conditional, which is specified to be ECMAScript code; similarly for the <assign> tag: <if cond="flavor == 'vanilla'"> <assign name="flavor_code" expr="'v'"/> <elseif cond="flavor == 'chocolate'"/> <assign name="flavor_code" expr="'h'"/> <elseif cond="flavor == 'strawberry'"/> <assign name="flavor_code" expr="'b'"/> <else/> <assign name="flavor_code" expr="'?'"/> </if> Certainly, part of VoiceXML 2.0 deals with semantic data descriptions, such as the prompts and other properties associated with voice-driven menus. But at its heart, this specification reaches far beyond typical applications of markup, and represents an attempt to use a markup language as a programming language. But since the markup language is inadequate, it needs to be augmented with an actual programming language (in this case, ECMAScript), and careful correspondence rules between the markup language and the programming language need to be made, and the specification needs to be salted with special flow-of-control rules about the interpretation of the MARKUP (such as the local scope of <throw>) which are properly the domain of a programming language. To make matters worse, the reserved symbols of the programming language are subordinate to the reserved symbols of the markup language. Here's a quote from the VoiceXML 2.0 specification: The expression language used in cond and expr is precisely ECMAScript. Note that the cond operators "<", "<=", and "&&" must be escaped in XML (to "<" and so on). For clarity, examples in this document do not use XML escapes. Note "for clarity"; the specification doesn't even use its actual syntax in its examples. This readability issue with reserved symbols is a symptom of the problem. XML was originally developed as an extensible, textual semantic description of data which had the flexibility of SGML without many of the expensive details. As such, XML was designed to be easily parsed and processed, but not to be READ. Unfortunately, visual programming techniques have not progressed to the point where VoiceXML programs could be constructed and digested by programmers without reading the actual text, and the actual text is illegible: it's a mishmash of markup and code, where the code is compromised by the reserved symbols of the markup language. Let's compare a couple examples: VoiceXML: <if cond="i < 1"> <assign name="i" expr="i+1"/> </if> ECMAScript alone: if (i < 1) { i = i + 1 } Exporting the flow of control into the markup language provides no discernible benefit to the code being written, and makes the resulting program (and it IS a program) significantly harder to read. Furthermore, the specification doesn't seem to have been consistently designed. There are a variety of ways of handling variable binding and assignment: - The <assign> tag is illustrated above. - The <param> tag is used to pass values to a <subdialog>. However, its "name" attribute specifies the name that the SUBDIALOGUE should know it by, and the value in the subdialogue is specified by either an "expr" or a "value" attribute on <param>. In other words, this process differs significantly from function calls in a typical programming language, because the caller, rather than the callee, determines the name of the local variable in the callee. - The <subdialog> tag also contains a "name" attribute, which declares the name of the value returned from the subdialogue. The subdialogue, in turn, has a <return> tag with a "namelist" attribute; the namelist is a space-separated list of local names whose values should be returned. The way these match up (a single name in the caller, a list of names in the callee) is that the elements in the namelist end up being ECMAScript attributes of an ECMAScript object stored in the parent's name. So if the caller has 'name="foo"', and the callee's <return> has 'namelist="bar baz"', the caller will know the two return values as foo.bar and foo.baz. This is the case even when there's a single value returned. - The <throw> tag has two attributes, "event" and "message", which correspond to the error class and error description in nonlocal control constructs in programming languages. However, the <catch> tag doesn't bind these attributes explicitly; there are two special variables, "_event" and "_message", which are set up when a <catch> is invoked. It's impossible to avoid the conclusions that (a) with all these control structures, this is a programming language, and (b) as a programming language, its design suffers from significant problems of clarity and coherence. Finally, we come to the issue of extensibility. One of the great successes of many previous W3C specifications is that they have been tremendously resilient in the face of extensions, and have done a good job of anticipating the variety of uses they might be put to. In general, the W3C specifications have achieved this by identifying and concentrating on the fundamental building blocks and thus constructing a "toolkit" out of which many applications can be built. XML is an exceptional example of such a success. The VoiceXML 2.0 specification, regrettably, is not. For example, it has anticipated some throw/catch behavior with tags like <help> and <noinput>, but left the rest to be defined as ECMAScript classes, some predefined, some user-defined. This represents the beginnings of using ECMAScript as a "grab bag" for whatever doesn't fit in the XML tags, and we predict that this will be the typical path for evolution in this specification. Under these circumstances, the XML tags themselves, especially for flow of control, will look more and more like impediments and design mistakes, rather than the manifestation of forward-looking design. The solution, we believe, is to acknowledge that the flow of control issues that VoiceXML is trying to address are simply not the domain of markup languages, but rather programming languages. What is needed is a specification which clearly separates the data description issues (such as the configuration of voice menus) from the flow of control (which ought to be left to programming languages which have actually been designed as such). Once we do this, large portions of the specification become superfluous (such as those which specify the relationship between the scoping rules of ECMAScript and of VoiceXML). Note that our recommendation implies that, since the W3C really isn't in the business of designing programming languages, there's a large chunk of dialogue processing which is simply outside the scope of the W3C charter. We believe that this is a perfectly appropriate outcome; there's no reason to believe that all of dialogue falls under the W3C purview. In addition, this specification contrasts rather severely with the other uses that the W3CVB has put XML to: static data descriptions (e.g., the speech recognition grammar specification) and dynamic data descriptions (e.g., the speech synthesis specification, which describes markup for input to a speech synthesizer). As we said above, we recognize that there are some data descriptions in the specification; but the attempt in VoiceXML 2.0 to unify them with the flow-of-control requirements of dialogue is, to our mind, fundamentally flawed. Summary ------- We have two primary objections to the VoiceXML 2.0 specification in its current form: its questionable status with regard to free implementations, and its flawed design as a programming language. We do not believe that the W3C should endorse the VoiceXML 2.0 specification without significant revisions which address both these issues. Respectfully submitted, Samuel Bayer John Aberdeen Bryan George Alan Goldschen Lynette Hirschman Bede McCall The MITRE Corporation
Received on Friday, 9 November 2001 04:51:03 UTC