Agreed recognition API?

By now the draft final report
(http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech.html)
contains a number of design agreements for the JavaScript API for
speech recognition. I thought it would be a useful exercise to
translate those agreements into a concrete API.

The below IDL describes my interpretation of the parts of the API that
we have agreed on so far. Many of the interface/function/attribute
names are not yet agreed, so I mixed and matched from the Microsoft,
Mozilla and Google proposals.

interface SpeechInputRequest {
   // URL (http: or data:) for an SRGS XML document, with or without SISR tags,
   // or a URI for one of the predefined grammars
   attribute DOMString grammar;
   // Recognition language. Language declared in grammar overrides this.
   attribute DOMString lang;
   // URL for speech recognition engine, http: must be supported
   attribute DOMString engine;

   // Not yet discussed I think, but Google and Microsoft proposals have it
   attribute long maxresults;

   // Some timeout parameters will likely be agreed, not yet discussed

    // Starts capturing audio and recognizing speech
    void startSpeechInput();
    // Stops capturing audio and lets speech recognition complete
    void stopSpeechInput();
    // Stops capuring audio and aborts speech recognition
    void cancelSpeechInput();

    attribute Function onaudiostart;
    attribute Function onsoundstart;
    attribute Function onspeechstart;
    attribute Function onspeechend;
    attribute Function onsoundend;
    attribute Function onaudioend;
    attribute Function onresult;
    attribute Function onerror;
};
SpeechInputRequest implements EventTarget;

Events:

audiostart, interface: Event: Audio capture has started
soundstart, interface: Event: Some sound, possibly speech, has been
detected (low latency)
speechstart, interface: Event: Speech start has been detected
speechend, interface: Event: Speech end has been detected (hmm, can we
really guarantee that this comes before soundend if the latter is a
client endpointer)
soundend, interface: Event: Sound end has been detected
audioend, interface: Event: Audio capture has finished
result, interface: SpeechResultEvent: Speech recognizer has returned a
final result with at least one recognition hypothesis
error, interface: Event (?): Speech end has been detected

// The event passed to the 'result' event handlers
interface SpeechResultEvent : Event {
    readonly attribute SpeechInputResult result;
};

// Recognition result as EMMA + simple N-best list
interface SpeechInputResult {
    readonly attribute Document resultEMMAXML;
    readonly attribute DOMString resultEMMAText;
    readonly attribute unsigned long length;
    getter SpeechInputResultAlternative item(in unsigned long index);
};

// Item in N-best list
interface SpeechInputAlternative {
    readonly attribute DOMString utterance;
    readonly attribute float confidence;
    readonly attribute any interpretation;
};


The HTML interface has not been agreed yet. The Mozilla proposal has
none. The Microsoft proposal has a <reco> element as a child of
<input> and other elements, or associated with elements using @for.
Google has @speech attribute for <input> elements. If we agree on
speech recognition element(s), the SpeechInputRequest interface should
either be able to serve as the DOM interface for such elements, or the
elements could have an attribute which contains a SpeechInputRequest.

-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902

Received on Thursday, 19 May 2011 15:00:00 UTC