SpeechRequest 1.0

Editor:
Olli Pettay <Olli@pettay.fi>

Abstract

....

Status of this Document

This document is an editors' copy that has no official standing.

Editor's Draft

Table of Contents

1 SpeechRequest
    1.1 Example 1, simple field level grammar + recognition handling
    1.2 Example 2, page level grammar + field level grammar
2 Text to speech
3 Requirements


1 SpeechRequest

        interface RecognitionResult
        {
          readonly attribute DOMString utterance;
          readonly attribute float confidence;
          readonly attribute any interpretation; // semantic interpretation
          readonly attribute DOMDocument fullResult; // EMMA document
        }

        interface RecognitionEvent : Event
        {
          readonly attribute RecognitionResult[] results;
        }

        // grammar is either an URL to SRGS or special case grammar:<grammar_hint>,
        // for example grammar:search, grammar:dictation, or null.
        //
        // Recognizer may use the language of aBoundElement as a hint.
        //
        // Note, recognizerURI and recognizerParams are for V2, in which case the
        // protocol between browser and external recognizer should be defined.
        //
        // Constructor throws an exception if some paramaters are invalid or not understood by the browser.
        [Constructor(in optional DOMNode aBoundElement, 
                     in optional DOMString grammar,
                     in optional DOMString recognizerURI,
                     in optional DOMString recognizerParams)]
        interface SpeechRequest : EventTarget
        {
          // Calling activate may bring up UI so that user can allow or deny
          // speech recognition.
          void activate(in optional boolean allowMultiRecognition = false);
          void deactivate();
        
          // Starts the recognition. All the active request are started.
          // Calling start() does implicitly call activate(allowMultiRecognition)
          // Returns true if calling start() did succeed. It may fail if browser
          // doesn't allow starting recognition.
          boolean start(in optional boolean allowMultiRecognition = false);

           // Stops the recognition. All the active requests are stopped.
          void stop(in option keepActive = false);

          attribute EventListener onstart; // Recognition started
          // Recognition stopped either by using stop() or user stopping interacting with the web page.
          attribute EventListener onstop;

          attribute EventListener onabort; // User aborts the recognition
        
          attribute EventListener onerror;  // An error which fires if the request is somehow invalid.
                                            // This could happen if loading the grammar doesn't work.
        
          attribute EventListener ondeactivated; // Some other request overrides this one
          attribute EventListener onactivated; // The overriding request is deactivated.

          attribute EventListener onpermissiondenied;
          attribute EventListener onpermissionaccepted;

          attribute EventListener onspeechstart;
          attribute EventListener onspeechstop;
          attribute EventListener onrecognition;

          attribute EventListener onnomatch;
          attribute EventListener onnospeech;
        }
    

By default only one request is activate at a time. If another is activated while some other is active, deactivated event is dispatched. When the other one is deactivated, the first one is activated again (unless deactivate was explicitly called) However, if true is passed to activate() method, then the old active request will stay active and it is up to the webapp to handle possible several speech recognitions.

Especially if a non-same-origin iframe tries to activate speech request, UA should show some doorhanger or similar to user and user should give page permission to start recognition. UA may use bound element as a hint to where to show the doorhanger. If aBoundElement is not given, root element or document should be used. Implementation needs to be careful when handling SpeechRequest from hidden iframes.

Speech recognition is activated for example using a button in the browser chrome or when user actively interacts with the web page by clicking mouse or pressing a key on keyboard.

1.1 Example 1, simple field level grammar + recognition handling

<script>
  var s = new SpeechRequest(document.getElementById('res'), "grammar.xml");
  s.onrecognition =
    function(e) {
      document.getElementById("res").value = e.results[0].utterance;
    }
</script>
<input id="res" onfocus="s.start();" onblur="s.stop()">

1.2 Example 2, page level grammar + field level grammar

<script>
  var pageLevel = new SpeechRequest(null, "globalgrammar.xml");
  pageLevel.onrecognition = function(e) {
    // do something page level, like zoom a map or give some help
    }
  // Page level grammar is active all the time.
  // Browser UI may start the recognition even if
  // start() isn't explicitly called.
  pageLevel.activate();

  var fieldLevel = new SpeechRequest(document.getElementById('res'), "fieldgrammar.xml");
  fieldLevel.onrecognition = function(e) {
    document.getElementById("res").value = e.utterance;
  }
</script>
<input id="res" onfocus="fieldLevel.start(true);" onblur="fieldLevel.stop()">

2 Text to speech

For text to speech similar approach as what Speech Input API has can be used; HTML audio element could be extended to support TTS. But this draft does not specify how audio element should be extended.

3 Requirements

Some of the requirements marked as PASS depend on UA to show reasonable UI to the user.
        3.1 Web Authoring Feature Requirements
        3.1.1 Web Authoring Feature Speech System Requirements
       [PASS] FPR8. User agent (browser) can refuse to use requested speech service. 
 [Only in V2] FPR11. If the web apps specify speech services, it should be possible to specify parameters.
 [Only in V2] FPR12. Speech services that can be specified by web apps must include network speech services.
       [FAIL] FPR31. User agents and speech services may agree to use alternate protocols for communication.
       [PASS] FPR32. Speech services that can be specified by web apps must include local speech services.
       [FAIL] FPR33. There should be at least one mandatory-to-support codec that isn't encumbered with IP issues and has sufficient fidelity and low bandwidth requirements.
       [PASS] FPR40. Web applications must be able to use barge-in (interrupting audio and TTS output when the user starts speaking).
 [Only in V2] FPR58. Web application and speech services must have a means of binding session information to communications.
 
        3.1.2 Web Authoring Feature Recognition Requirements
       [PASS] FPR2. Implementations must support the XML format of SRGS and must support SISR.
       [PASS] FPR4. It should be possible for the web application to get the recognition results in a standard format such as EMMA.
       [PASS] FPR19. User-initiated speech input should be possible.
       [PASS] FPR21. The web app should be notified that capture starts.
       [PASS] FPR22. The web app should be notified that speech is considered to have started for the purposes of recognition.
       [PASS] FPR23. The web app should be notified that speech is considered to have ended for the purposes of recognition.
       [PASS] FPR24. The web app should be notified when recognition results are available.
       [PASS] FPR25. Implementations should be allowed to start processing captured audio before the capture completes.
       [PASS] FPR26. The API to do recognition should not introduce unneeded latency.
       [PASS] FPR27. Speech recognition implementations should be allowed to add implementation specific information to speech recognition results.
       [FAIL] FPR28. Speech recognition implementations should be allowed to fire implementation specific events.
       [PASS] FPR34. Web application must be able to specify domain specific custom grammars.
       [PASS] FPR35. Web application must be notified when speech recognition errors or non-matches occur.
       [PASS] FPR42. It should be possible for user agents to allow hands-free speech input.
       [PASS] FPR43. User agents should not be required to allow hands-free speech input.
       [PASS] FPR47. When speech input is used to provide input to a web app, it should be possible for the user to select alternative input methods.
       [FAIL] FPR48. Web application author must be able to specify a domain specific statistical language model.
       [PASS] FPR50. Web applications must not be prevented from integrating input from multiple modalities.
       [PASS] FPR54. Web apps should be able to customize all aspects of the user interface for speech recognition, except where such customizations conflict with security and privacy requirements in this document, or where they cause other security or privacy problems.
       [FAIL] FPR56. Web applications must be able to request NL interpretation based only on text input (no audio sent).
       [FAIL] FPR57. Web applications must be able to request recognition based on previously sent audio.
       [PASS] FPR59. While capture is happening, there must be a way for the web application to abort the capture and recognition process.
       
        3.1.3 Web Authoring Feature Synthesis Requirements
       [FAIL] FPR3. Implementation must support SSML.
       [FAIL] FPR29. Speech synthesis implementations should be allowed to fire implementation specific events.
       [PASS] FPR41. It should be easy to extend the standard without affecting existing speech applications.
       [FAIL] FPR46. Web apps should be able to specify which voice is used for TTS.
       [FAIL] FPR51. The web app should be notified when TTS playback starts.
       [FAIL] FPR52. The web app should be notified when TTS playback finishes.
       [FAIL] FPR53. The web app should be notified when the audio corresponding to a TTS mark element is played back.
       [FAIL] FPR60. Web application must be able to programatically abort tts output.
    3.2 Web Authoring Convenience Requirements
        3.2.1 Web Authoring Convenience Speech System Requirements
 [Only in V2] FPR7. Web apps should be able to request speech service different from default.
       [PASS] FPR9. If browser refuses, it must inform the web app.
 [Only in V2] FPR10. If browser uses speech services other than the default one, it must inform the user which one(s) it is using.
 [Only in V2] FPR30. Web applications must be allowed at least one form of communication with a particular speech service that is supported in all UAs.
        3.2.2 Web Authoring Convenience Recognition Requirements
       [PASS] FPR5. It should be easy for the web appls to get access to the most common pieces of recognition results such as utterance, confidence, and nbests.
       [PASS] FPR6. Browser must provide default speech resource.
       [PASS] FPR36. User agents must provide a default interface to control speech recognition.
       [PASS] FPR38. Web application must be able to specify language of recognition.
       [FAIL] FPR39. Web application must be able to be notified when the selected language is not available.
       [PASS] FPR44. Recognition without specifying a grammar should be possible.
       [PASS] FPR45. Applications should be able to specify the grammars (or lack thereof) separately for each recognition.
        3.2.3 Web Authoring Convenience Synthesis Requirements
       [PASS] FPR13. It should be easy to assign recognition results to a single input field.
       [PASS] FPR14. It should not be required to fill an input field every time there is a recognition result.
       [PASS] FPR15. It should be possible to use recognition results to multiple input fields.
       [FAIL] FPR61. Aborting the TTS output should be efficient.
    3.3 Security and Privacy Requirements
        3.3.1 Security and Privacy Speech System Requirements
       [PASS] FPR16. User consent should be informed consent.
       [PASS] FPR20. The spec should not unnecessarily restrict the UA's choice in privacy policy.
[Only in V2?] FPR55. Web application must be able to encrypt communications to remote speech service.
        3.3.2 Security and Privacy Recognition Requirements
       [PASS] FPR1. Web applications must not capture audio without the user's consent.
       [PASS] FPR17. While capture is happening, there must be an obvious way for the user to abort the capture and recognition process.
       [PASS] FPR18. It must be possible for the user to revoke consent.
       [FAIL] FPR37. Web application should be given captured audio access only after explicit consent from the user.
       [PASS] FPR49. End users need a clear indication whenever microphone is listening to the user