Re: Agreed recognition API? from Bjorn Bringert on 2011-05-19 (public-xg-htmlspeech@w3.org from May 2011)

From: Bjorn Bringert <bringert@google.com>
Date: Thu, 19 May 2011 16:51:30 +0100
To: Olli@pettay.fi
Cc: public-xg-htmlspeech@w3.org
Message-ID: <BANLkTimAAHguC922-oXv6p1aUamveZ1S2A@mail.gmail.com>
On Thu, May 19, 2011 at 4:39 PM, Olli Pettay <Olli.Pettay@helsinki.fi> wrote:
> On 05/19/2011 05:59 PM, Bjorn Bringert wrote:
>>
>> By now the draft final report
>> (http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech.html)
>> contains a number of design agreements for the JavaScript API for
>> speech recognition. I thought it would be a useful exercise to
>> translate those agreements into a concrete API.
>>
>> The below IDL describes my interpretation of the parts of the API that
>> we have agreed on so far. Many of the interface/function/attribute
>> names are not yet agreed, so I mixed and matched from the Microsoft,
>> Mozilla and Google proposals.
>>
>> interface SpeechInputRequest {
>>    // URL (http: or data:) for an SRGS XML document, with or without SISR
>> tags,
>>    // or a URI for one of the predefined grammars
>>    attribute DOMString grammar;
>
> I think we need to support either multiple simultaneous grammars or
> SIRs. MS has GrammarCollection, so it supports multiple grammars,
> SpeechRequest API support multiple active recognition objects.

Yeah, this is a known area for discussion. I only put in the single
field, since we all agree that we need to support at least one grammar
:-)


>>    // Recognition language. Language declared in grammar overrides this.
>>    attribute DOMString lang;
>
> I wonder still how to handle language in a don't-leak-privacy-data way.
> There are very good use cases for lang, but the privacy problem should be
> solved.
>
>
>>    // URL for speech recognition engine, http: must be supported
>>    attribute DOMString engine;
>>
>>    // Not yet discussed I think, but Google and Microsoft proposals have
>> it
>>    attribute long maxresults;
>
> Very reasonable.
>
>>
>>    // Some timeout parameters will likely be agreed, not yet discussed
>
> ditto
>
>>
>>     // Starts capturing audio and recognizing speech
>>     void startSpeechInput();
>>     // Stops capturing audio and lets speech recognition complete
>>     void stopSpeechInput();
>>     // Stops capuring audio and aborts speech recognition
>>     void cancelSpeechInput();
>>
>>     attribute Function onaudiostart;
>>     attribute Function onsoundstart;
>>     attribute Function onspeechstart;
>>     attribute Function onspeechend;
>>     attribute Function onsoundend;
>>     attribute Function onaudioend;
>>     attribute Function onresult;
>>     attribute Function onerror;
>> };
>> SpeechInputRequest implements EventTarget;
>>
>> Events:
>>
>> audiostart, interface: Event: Audio capture has started
>> soundstart, interface: Event: Some sound, possibly speech, has been
>> detected (low latency)
>> speechstart, interface: Event: Speech start has been detected
>> speechend, interface: Event: Speech end has been detected (hmm, can we
>> really guarantee that this comes before soundend if the latter is a
>> client endpointer)
>> soundend, interface: Event: Sound end has been detected
>> audioend, interface: Event: Audio capture has finished
>> result, interface: SpeechResultEvent: Speech recognizer has returned a
>> final result with at least one recognition hypothesis
>> error, interface: Event (?): Speech end has been detected
>>
>> // The event passed to the 'result' event handlers
>> interface SpeechResultEvent : Event {
>>     readonly attribute SpeechInputResult result;
>> };
>>
>> // Recognition result as EMMA + simple N-best list
>> interface SpeechInputResult {
>>     readonly attribute Document resultEMMAXML;
>>     readonly attribute DOMString resultEMMAText;
>>     readonly attribute unsigned long length;
>>     getter SpeechInputResultAlternative item(in unsigned long index);
>> };
>>
>> // Item in N-best list
>> interface SpeechInputAlternative {
>>     readonly attribute DOMString utterance;
>>     readonly attribute float confidence;
>>     readonly attribute any interpretation;
>> };
>>
>>
>> The HTML interface has not been agreed yet. The Mozilla proposal has
>> none. The Microsoft proposal has a<reco>  element as a child of
>> <input>  and other elements, or associated with elements using @for.
>> Google has @speech attribute for<input>  elements. If we agree on
>> speech recognition element(s), the SpeechInputRequest interface should
>> either be able to serve as the DOM interface for such elements, or the
>> elements could have an attribute which contains a SpeechInputRequest.
>
> My proposal has the boundElement which has some similarity to MS' @for, but
> sure, it doesn't have any element for the asr/reco, only JS object.
>
>
>
> -Olli
>



-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902
Received on Thursday, 19 May 2011 15:51:56 UTC