RE: R21. Any public interface for creating extensions should be speakable from Robert Brown on 2010-11-05 (public-xg-htmlspeech@w3.org from November 2010)

From: Robert Brown <Robert.Brown@microsoft.com>
Date: Fri, 5 Nov 2010 06:33:42 +0000
To: "Eric S. Johansson" <esj@harvee.org>, Satish Sampath <satish@google.com>
CC: Bjorn Bringert <bringert@google.com>, Dan Burnett <dburnett@voxeo.com>, "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <113BCF28740AF44989BE7D3F84AE18DD1643CEC0@TK5EX14MBXC114.redmond.corp.microsoft.>
Eric, thanks for fleshing this out.  But I'm still having trouble detecting the line between API requirements, examples of user agent and/or application capabilities that could be built with the API, and design recommendations (feedback UX, text-editing UX, overlays, SAPI5 grammars, etc).  This still sounds to me like some good examples of richer applications that may be built using the API.


-----Original Message-----
From: Eric S. Johansson [mailto:esj@harvee.org] 
Sent: Thursday, November 04, 2010 10:15 PM
To: Satish Sampath
Cc: Robert Brown; Bjorn Bringert; Dan Burnett; public-xg-htmlspeech@w3.org
Subject: Re: R21. Any public interface for creating extensions should be speakable

On 11/4/2010 6:02 PM, Satish Sampath wrote:
> A set of use cases which clearly describe what 'end user extensions'
> mean would help in deciding whether they should be in scope or out of 
> scope of our work.
>
> I believe Eric Johansson suggested these requirements initially. Eric, 
> can you help us with a few use cases for these requirements?

sorry about my lack of participation. I was eaten alive by a bunch of things but unfortunately not any work involving speech user interfaces.

As a general rule, I argue that user extensions that affect or control part of or a complete speech user interface are part of a speech recognition environment. with that outlook, it's an "obvious" conclusion to say that it is part of our Scope of work.

The best use case I can think of is the overlay of a speech user interface on top of a cloud application. In addition to classic command-and-control grammar and actions, there should be other Windows displaying or accepting results recognition process and translating it to/from what is in the application itself. For example, an e-mail client would have a three pane layout with an overlay or extension adding markers internal to the Windows that can be spoken 
for direct control.   in addition, the extensions should be capable of disabling 
single key commands to make the environment more robust in the face of a misrecognition change in command to simple text injection.  A classic example of this kind of failure is vi.  if your command fails and becomes injected text, heaven help you figuring out what the heck happened.  a third feature of extensions will be additional speech recognition target only Windows popping up to inform or convert data to something that will work with speech recognition.

Some of the features that should be available by this interface is SAPI five compliant grammar, microphone control, "heard word", speech driven editing of buffer text, to name a few.

Ideally, if all of the local interfaces are identical between different recognition engines, then my command UI should just work. I'm sure it will be just as portable as CSS. I hope that your reaction, like mine is to think about ways of specifying alternate cases for user extensions to accommodate potential IE 6 hell scenarios.

I need to think about as little more but I think all of the use cases reduce to:

1. "The vendor failed and I need to fix their bugs in user interface"
2. "The vendor failed and created a user interface that breaks horribly with recognition failure'
3. Vendor outfitting or retrofitting application with speech user interface 4. Disabled user or user community retrofitting application with speech user interface.

Are any others come to mind?

r19 lets us handle the the extreme use cases of the user (ant) keeping all important executables and data local (on a laptop) versus the user (grasshopper) keeping nothing in their own control but instead trusting third-party for everything (extensions, recognition, etc.). The user can store their extensions anyway they want.

r21 Is important and it's missing a lot. The original goal of r21 was to make sure that any of the features of the speech recognition environment were always available to a disabled or developer. This means the graphical user interface for everything has a speech user interface. Any data files or programming languages also are something they can be completely spoken and edited by voice.

One important implication is that data files like XML-based data sets described in conversations need to have some sort of editing tool/framework which lets you edit XML by speech. This is not horribly hard but it looks nothing like the "speak the keyboard" solutions proposed by people who don't use speech recognition. The solution is some form of a shadow buffer which translates the XML to a speakable form and you edit within that shadow buffer which changes the XML as you say you're done. No I don't have a working model yet. It's a goal but it's taking more resources  than I have available at the moment.

Therefore, I agree with others that unfortunately it is necessary to declare r21 out of scope even though it means excluding disabled developers from working with the tools they need.  the accessibility requirement can be satisfied at a later time through dedicated applications,  extensions and user interface models similar to ones I've developed as part of my programming by speech effort.  
unfortunately, if you know history, you know this will take forever or longer to be done.

Again I apologize for not getting back to people sooner. Let me know what additional info you need and I'll try to fill in the blanks this weekend.
Received on Friday, 5 November 2010 06:34:20 UTC