Re: An early draft of a speech API from Eric S. Johansson on 2011-03-16 (public-xg-htmlspeech@w3.org from March 2011)

From: Eric S. Johansson <esj@harvee.org>
Date: Wed, 16 Mar 2011 11:51:00 -0400
To: public-xg-htmlspeech@w3.org
Message-ID: <4D80DC64.7010604@harvee.org>
Note: bracketed text means [misrecognition | corrected ]

On 3/16/2011 9:37 AM, Olli Pettay wrote:
> On 03/16/2011 03:16 AM, Eric S. Johansson wrote:
>> On 3/15/2011 5:11 PM, Olli Pettay wrote:
>>> On 03/15/2011 09:57 PM, Young, Milan wrote:
>>>> I agree with Robert that the Mozilla proposal doesn't feel very
>>>> "open". I'd further suggest that the Google speech proposal has
>>>> similar properties.
>>>>
>>>> In both cases, there is a tight coupling between the browser and
>>>> speech service that is outside of W3C and IETF turf. This closed
>>>> model has all of the usual implications such as:
>>>> * A cross-product of
>>>> integrations across UA and SS
>>> If Nuance has a public web based speech service and it exposes
>>> the API for it, browsers
>>> could use it as a default speech engine when the device is online.
>>> Or browsers could use some other engine.
>>
>> We need the same API for both local and remote speech recognition
>> engines.
> I was talking about the API between browser and some speech engine,
> not the API which web developers would use.
Sorry, maybe it would be better to think of this as having been a response to 
the Google proposal which looks like a web developer API.

>
>
>  If you want to see the kind of things people are doing today
>> speech recognition APIs take a look at vocola, and dragonfly
>>
>> http://vocola.net/
>> http://code.google.com/p/dragonfly/
>>
>> These are two toolkits in very heavy use within the technically capable
>> speech recognition community.
>
> Web platform provides already a lot of what
> for example Dragonfly seem to handle, like Action and Window packages.
> The becoming API will handle Grammar (this is largely just W3C SRGS
> and SISR) and Engine packages. It is then up to the web application
> to do whatever it wants with the recognition result.
>
> Vocola is a language. It seems to be largely an
> alternative to W3C SRGS and SISR + it has some functionality
> which web platform already provides.
> It has also things like ShellExecute, which of course won't be
> provided in an API which can be used in web pages.

Thank you pointers to other working group information. I have some more  reading 
ahead of me.

> It looks like the target audience of Dragonfly and Vocola
> is quite different than what we have for HTML Speech.
> "Vocola is a voice command language—a language for creating commands to 
> control a computer by voice."
> We're not trying to create an API to control computer by voice,
> only web apps, and only if web app itself uses the API.
> And Web platform has already rich API which
> can and should be utilized.

vocola simplifies new user interface creation. It exposes the capabilities of 
the underlying API used by natlink by a simplified interface.  we don't tell the 
users this because it would only confuse them but with had some fairly good 
success with non-geek people adding UI elements to their applications. It's sort 
of like the way spreadsheets lets people program without knowing that they are.

The differentiation between controlling the computer and only web apps by voice 
and only web apps that use the API is pretty silly.  A gateway or proxy to 
speech recognition engine would enable a browser or the host add speech 
recognition capability by bridging the Gateway interface to the internal 
interface. This proxy model will handle both local and remote speech recognition 
engines.

For example, if I am totally in love with nuance and don't trust any other 
speech recognition vendor, I could throw the switch on my proxy to use the 
nuance engine for every application.  on the other hand, let's say there's a bug 
in NaturallySpeaking replaces all of my can'ts with cans (this may actually be 
the real bug) and I have access to the Google speech recognition engine than I 
could throw the proxy switch to use that engine from that time forward.

If a web application is running in the speech enabled browser as described 
above, I should be able to add my own commands even if the application has no 
knowledge of speech recognition. Exactly how that would happen, I'm not sure. 
Maybe something along the line of tweaking various elements of the Dom and then 
"pushing buttons". Maybe there's another alternative. But the point is, I should 
have the ability to change the user interface.  the underlying API would have no 
clue that the changes been made because all speech recognition engines behave 
identically at the API level (you can stop laughing now).

As for the Web API being "rich", it probably means the API is too complex for 
end-users and we'll need to put an API on top of that to simplify and make it 
easier for end-users to do their own accessibility interfaces.

>> Whatever you do for API, we have a demonstrated need for to support
>> projects of a level of complexity comparable to voice code.
> What do you mean with this?

It looks like it's probably a speech recognition error [but | that] I didn't catch.

What I was trying to say is, whatever API is present, it must support projects 
that are as complex as voice code.  there's a small bunch of us who have done 
things with speech recognition that you rarely in commercial projects. We end up 
filling in the accessibility gap that speech recognition vendors can't and, I 
would argue, shouldn't even try to fill.  it doesn't make sense economically 
because it's a lot of work to cover little ground and there's no money there.

Unfortunately, we are fighting a losing battle because vendors either break the 
API we count on (there are very few of the technical users migrating from 10.1 
to 11 NaturallySpeaking), or we don't get the level of functionality we need. 
Obviously , with the information you've given me, I need to sit down and go over 
the web API to see if I can make it do what I need to do for some speech user 
interface models I've developed and see what's possible.

On a side note, I really hate giving out Python for JavaScript because I can 
dictate a fair amount of Python with ordinary speech recognition but I can't get 
off the mark with JavaScript. It is a horrible language and we need much better 
editing tools and interaction models before we can even start to use JavaScript. 
We, of course meaning disabled developers.
user customization of the speech user interface.

 > IMO, it is up to the UA to allow or disallow
> user specific local customization for web app user interface (whether
> it is graphical or speech user interface). None of the APIs
> disallows or disables that. One could use Greasemonkey scripts or
> similar to customize the UI of some particular page.

Okay, that's a good point. I would argue that the interests of not screwing up 
accessibility this time like we did with desktop applications, that local 
customization should always be enabled. May be we need to click a checkbox to 
turn it on but, the capability should always be there. Of course, this 
starts[an] entirely separate [hand | and] probably lengthy conversation on 
security issues. For the moment, let's wave our hands and say there are much 
smarter people who can solve this problem for us. My initial thought would be 
that there is either a sandbox within the user agent or a sandbox process on the 
host that holds all of the local customization code. Needs more thought.
>
>
>> I've raised this in the
>> context of accessibility but it's also a valid concern for third-party
>> vendors who come up with a better way to implement or expand an
>> interface for application.
> What case do you have in mind? Some other vendor to provide speech UI
> for Gmail for example? That would be possible with browser addons or
> Greasemonkey scripts. But IMO, not really something this group
> should focus on.

I would never ever inflict grease monkey scripts on an end-user or commercial 
customer as a way of doing any sort of add-on. Way too fragile. It may not be 
the responsibility of this group but, let's build a sane, rational, 
deterministic way of modifying user interface that will survive updates and 
revisions.

In medical records, often times there are specific procedures that a hospital 
wants to follow but the "cast in stone" user interface forces them to change 
their procedures to meet the needs of user interface. The addition of speech 
recognition only helps the transcription or dictation of text into various 
fields. If one could get in and modify the user interface, one could enable a 
less vocally intensive user interface built on the hospital's preferred workflow.

In interface I could talk more about would be an e-mail type interface. The 
basic process of sending a mail message would be something like starting out by 
dictating the body of the e-mail, then giving a list of people to send to 
(complete with visual feedback for resolution of ambiguous names), maybe adding 
a subject and then the final command to send with positive confirmation.

That's upside down from the way the current user model works. Far too often I 
have been working on e-mail message only to have NaturallySpeaking hear the 
command to deliver the message and off it goes with me trying hard not scream so 
loud I terrify my office mates. Yes, that's a user interface I would love to rework.

Set me down with any application [up to | and I will tell] you how you need to 
rework it for speech recognition use. There is almost no overlap between tall 
and narrow GUI and wide and shallow speech UI.

--- eric
Received on Wednesday, 16 March 2011 15:52:42 UTC