[convapps] Summary of the Workshop on Conversational Applications

Summary of the Workshop on Conversational Applications ― Use Cases
and Requirements for New Models of Human Language to Support Mobile
Conversational Systems ―

On June 18th and 19th, 2010, W3C (the World Wide Web Consortium) held
a Workshop on "Conversational Applications -- Use Cases and
Requirements for New Models of Human Language to Support Mobile
Conversational Systems".

The minutes of the workshop are available on the W3C Web server:
 http://www.w3.org/2010/02/convapps/minutes.html

Also the HTML version of this summary, which includes links to
detailed discussions on each use case, is available at:
 http://www.w3.org/2010/02/convapps/summary.html

The goal of the workshop was to understand the limitations of the
current W3C language model in order to develop a more comprehensive
model. The plan for the workshop was to collect and analyze use cases
and prioritize requirements that ultimately would be used to identify
improvements to the model of human language currently supported by W3C
standards.

Openstream graciously hosted the workshop in Somerset, New Jersey,
providing us with fabulous facilities conveniently located in the
hotel, with an incredible amount and array of food, excellent Internet
access, plentiful power, etc.  In short, Openstream provided the
perfect arrangements for this workshop.

The workshop had attendees from Openstream, Conversational
Technologies, Voxeo, IBM, Cambridge Mobile, Redstart Systems,
Loquendo, Nuance, NICT, AT&T, Verison Wireless and W3C.

The first day was spent on brief presentations of the attendees'
position papers, along with discussion.  The presentation topic
sessions were:
- Lexical and Morphological Standards
- Grammars, Semantics, and Natural Language processing
- Architectures and Dialog System Integration

At the end of the first day the presenters were asked to write up
answers to the following:

1) Describe a situation that demonstrates the issue.
2) Describe your implementation.
3) Why were you not able to use only existing standards to accomplish this?
4) What might you suggest could be standardized?

During the second day, we broke into smaller groups to work on
extracting detailed use cases based on the answers to questions 1 and
3.  After combining similar use cases, we then did a rough straw poll
to determine the approximate level of group interest in each use case.
The use cases, roughly in order from most group interest to least,
were:

- Dynamic on the fly activation deactivation or combination of any
 constrained and unconstrained recognition (SRGS (Speech Recognition
 Grammar Specification) or SLM (Statistical Language Model) grammars)
 or reco constraints.  There are applications that in intelligent
 conversation combine open-ended and restricted language, so we need
 a mechanism to specify how to combine any recognition constraint.
 Moreover we need to be able to dynamically weight those recognition
 constraints based on context.

- Applications need to be sensitive to certain (arbitrary, dynamically
 extracted) features, e.g. gender, age, etc.  Example: adjust voice,
 phrasing, etc. based on those features. Current limitation: current
 VXML (VoiceXML) infrastructure only allows
 words/interpretations/confidence.  We need a place to put this info
 so that it gets transmitted to the application.

- Syntactic Formalism. Today an author cannot create a syntactic
 grammar for comprehensive NL (natural language) because the
 formalism lacks features inheritance, POS (part of speech)
 terminals, concord, inversion, etc.  A new formalism should be
 created.

- Semantic representation of dialogue state that can include any kind
 of data (e.g. history slot conditions, user models, expectation of
 next system actions).  The problem is that the current VXML 2.0
 specification does not support a container of dialog states that
 contains multiple hypotheses of dialogue state.

- Shared Syntactic Grammars (for simultaneously running applications);
 Combine reco constraints when multiple apps are active
 simultaneously and transfer focus.

- Some dialogue systems contain discourse and WSD (word-sense
 disambiguation) info that could be used to improve spoken rendering.
 Example : "record" (noun vs. verb).  Need a mechanism to convey the
 information between those components without having to modify the
 categorization either of the dialogue or the synthesis system.

- R&D Agility: As we do research we develop new algorithms that need
 new information and we would like to experiment with them before
 standardization.  We need a reliable mechanism in VXML to carry this
 information.  Examples: add location information, new DSR
 (distributed speech recognition) signal features.  We would like a
 standard way to set vendor-specific recognition result info that was
 required to be passed to the application.

- Need a way for users to resolve conflicting commands.  Need for a
 way to organize and share commands, for the user to have a way to
 remember and prioritize commands.  So, one solution might be to have
 user-configuration for conflict resolution.  Today users can't find
 commands, adjust them, organize them, or share them.

- EMMA Extension: Multi source input and corresponding confidence.  In
 multimodal applications and more advanced applications, input might
 come from a variety of simultaneous source such as text, speech, GPS
 (global positioning system), world knowledge, user profile, etc.
 For instance, I might say: "I want to go to Denver" and the
 application can know from GPS where I am. Perhaps each concept or
 slot could have a multiple set of input sources with corresponding
 values and confidence scores.

- Focus change - users need a way to tell the device how to control
 focus.  When using a mouse the focus is clear, but not necessarily
 so when using speech commands. There's no standard way to do this.

- Users are afraid to make mistakes using speech: users need a way to
 undo both actions and text events.

- Interactions between lexicons and grammars don't include additional
 info such as POS (part of speech), Grammatical, or other.  Example:
 it would be nice to annotate a name with the region (location) to
 influence pronunciation.

- EMMA Extension for Richer Semantic Representation.  We want to be
 able to represent the semantics of complex NL.  Example: "give me
 all the toppings except onions", or "I want to leave this afternoon
 or tomorrow morning to arrive before noon".  In current standards we
 can represent attribute-value pairs and there is hierarchy in EMMA,
 but there is no way to specify modifiers and quantifiers between
 slots.

- Phomeme sets: an author can't create a component (app, lexicon, asr
 engine, tts engine, etc.) which is assured to be interoperable with
 other components in terms of the phoneme set. The author should be
 able to use and specify a pre-defined standard phoneme set.

- Problem Solving: we want to be able to do applications that solve
 complex problems, like help desk problem solving. The call control
 logic of such an application cannot today be efficiently described
 as a state machine; therefore, available standards (VXML, SCXML) are
 insufficient to implement these applications, e.g. probabilistic
 rule engine, or task agent system.

- Morphology engine: today, there is no engine component or formalism
 for morphology.  This is required to create appropriate replies and
 provide a higher level of abstraction for developers and systems.
 Therefore a new formalism and engine component for morphology should
 be created.


As a group, for the top 5 use cases we then brainstormed on possible
new standards or extensions to existing ones that could possibly
address each use case.

The use cases presented above will next be sent to the W3C Voice
Browser and Multimodal Interaction Working Groups, where they will be
reviewed and the groups will make recommendations for changes to
existing specifications and/or suggestions for new specifications.

Daniel C. Burnett, Deborah Dahl, Kazuyuki Ashimura and
James A. Larson, Workshop Organizing Committee 

-- 
Kazuyuki Ashimura / W3C Multimodal & Voice Activity Lead
mailto: ashimura@w3.org
voice: +81.466.49.1170 / fax: +81.466.49.1171

Received on Wednesday, 30 June 2010 19:31:13 UTC