- From: Eric S. Johansson <esj@harvee.org>
- Date: Wed, 16 Mar 2011 11:51:00 -0400
- To: public-xg-htmlspeech@w3.org
Note: bracketed text means [misrecognition | corrected ] On 3/16/2011 9:37 AM, Olli Pettay wrote: > On 03/16/2011 03:16 AM, Eric S. Johansson wrote: >> On 3/15/2011 5:11 PM, Olli Pettay wrote: >>> On 03/15/2011 09:57 PM, Young, Milan wrote: >>>> I agree with Robert that the Mozilla proposal doesn't feel very >>>> "open". I'd further suggest that the Google speech proposal has >>>> similar properties. >>>> >>>> In both cases, there is a tight coupling between the browser and >>>> speech service that is outside of W3C and IETF turf. This closed >>>> model has all of the usual implications such as: >>>> * A cross-product of >>>> integrations across UA and SS >>> If Nuance has a public web based speech service and it exposes >>> the API for it, browsers >>> could use it as a default speech engine when the device is online. >>> Or browsers could use some other engine. >> >> We need the same API for both local and remote speech recognition >> engines. > I was talking about the API between browser and some speech engine, > not the API which web developers would use. Sorry, maybe it would be better to think of this as having been a response to the Google proposal which looks like a web developer API. > > > If you want to see the kind of things people are doing today >> speech recognition APIs take a look at vocola, and dragonfly >> >> http://vocola.net/ >> http://code.google.com/p/dragonfly/ >> >> These are two toolkits in very heavy use within the technically capable >> speech recognition community. > > Web platform provides already a lot of what > for example Dragonfly seem to handle, like Action and Window packages. > The becoming API will handle Grammar (this is largely just W3C SRGS > and SISR) and Engine packages. It is then up to the web application > to do whatever it wants with the recognition result. > > Vocola is a language. It seems to be largely an > alternative to W3C SRGS and SISR + it has some functionality > which web platform already provides. > It has also things like ShellExecute, which of course won't be > provided in an API which can be used in web pages. Thank you pointers to other working group information. I have some more reading ahead of me. > It looks like the target audience of Dragonfly and Vocola > is quite different than what we have for HTML Speech. > "Vocola is a voice command language—a language for creating commands to > control a computer by voice." > We're not trying to create an API to control computer by voice, > only web apps, and only if web app itself uses the API. > And Web platform has already rich API which > can and should be utilized. vocola simplifies new user interface creation. It exposes the capabilities of the underlying API used by natlink by a simplified interface. we don't tell the users this because it would only confuse them but with had some fairly good success with non-geek people adding UI elements to their applications. It's sort of like the way spreadsheets lets people program without knowing that they are. The differentiation between controlling the computer and only web apps by voice and only web apps that use the API is pretty silly. A gateway or proxy to speech recognition engine would enable a browser or the host add speech recognition capability by bridging the Gateway interface to the internal interface. This proxy model will handle both local and remote speech recognition engines. For example, if I am totally in love with nuance and don't trust any other speech recognition vendor, I could throw the switch on my proxy to use the nuance engine for every application. on the other hand, let's say there's a bug in NaturallySpeaking replaces all of my can'ts with cans (this may actually be the real bug) and I have access to the Google speech recognition engine than I could throw the proxy switch to use that engine from that time forward. If a web application is running in the speech enabled browser as described above, I should be able to add my own commands even if the application has no knowledge of speech recognition. Exactly how that would happen, I'm not sure. Maybe something along the line of tweaking various elements of the Dom and then "pushing buttons". Maybe there's another alternative. But the point is, I should have the ability to change the user interface. the underlying API would have no clue that the changes been made because all speech recognition engines behave identically at the API level (you can stop laughing now). As for the Web API being "rich", it probably means the API is too complex for end-users and we'll need to put an API on top of that to simplify and make it easier for end-users to do their own accessibility interfaces. >> Whatever you do for API, we have a demonstrated need for to support >> projects of a level of complexity comparable to voice code. > What do you mean with this? It looks like it's probably a speech recognition error [but | that] I didn't catch. What I was trying to say is, whatever API is present, it must support projects that are as complex as voice code. there's a small bunch of us who have done things with speech recognition that you rarely in commercial projects. We end up filling in the accessibility gap that speech recognition vendors can't and, I would argue, shouldn't even try to fill. it doesn't make sense economically because it's a lot of work to cover little ground and there's no money there. Unfortunately, we are fighting a losing battle because vendors either break the API we count on (there are very few of the technical users migrating from 10.1 to 11 NaturallySpeaking), or we don't get the level of functionality we need. Obviously , with the information you've given me, I need to sit down and go over the web API to see if I can make it do what I need to do for some speech user interface models I've developed and see what's possible. On a side note, I really hate giving out Python for JavaScript because I can dictate a fair amount of Python with ordinary speech recognition but I can't get off the mark with JavaScript. It is a horrible language and we need much better editing tools and interaction models before we can even start to use JavaScript. We, of course meaning disabled developers. user customization of the speech user interface. > IMO, it is up to the UA to allow or disallow > user specific local customization for web app user interface (whether > it is graphical or speech user interface). None of the APIs > disallows or disables that. One could use Greasemonkey scripts or > similar to customize the UI of some particular page. Okay, that's a good point. I would argue that the interests of not screwing up accessibility this time like we did with desktop applications, that local customization should always be enabled. May be we need to click a checkbox to turn it on but, the capability should always be there. Of course, this starts[an] entirely separate [hand | and] probably lengthy conversation on security issues. For the moment, let's wave our hands and say there are much smarter people who can solve this problem for us. My initial thought would be that there is either a sandbox within the user agent or a sandbox process on the host that holds all of the local customization code. Needs more thought. > > >> I've raised this in the >> context of accessibility but it's also a valid concern for third-party >> vendors who come up with a better way to implement or expand an >> interface for application. > What case do you have in mind? Some other vendor to provide speech UI > for Gmail for example? That would be possible with browser addons or > Greasemonkey scripts. But IMO, not really something this group > should focus on. I would never ever inflict grease monkey scripts on an end-user or commercial customer as a way of doing any sort of add-on. Way too fragile. It may not be the responsibility of this group but, let's build a sane, rational, deterministic way of modifying user interface that will survive updates and revisions. In medical records, often times there are specific procedures that a hospital wants to follow but the "cast in stone" user interface forces them to change their procedures to meet the needs of user interface. The addition of speech recognition only helps the transcription or dictation of text into various fields. If one could get in and modify the user interface, one could enable a less vocally intensive user interface built on the hospital's preferred workflow. In interface I could talk more about would be an e-mail type interface. The basic process of sending a mail message would be something like starting out by dictating the body of the e-mail, then giving a list of people to send to (complete with visual feedback for resolution of ambiguous names), maybe adding a subject and then the final command to send with positive confirmation. That's upside down from the way the current user model works. Far too often I have been working on e-mail message only to have NaturallySpeaking hear the command to deliver the message and off it goes with me trying hard not scream so loud I terrify my office mates. Yes, that's a user interface I would love to rework. Set me down with any application [up to | and I will tell] you how you need to rework it for speech recognition use. There is almost no overlap between tall and narrow GUI and wide and shallow speech UI. --- eric
Received on Wednesday, 16 March 2011 15:52:42 UTC