Proposal for SpeechInputResult

Here is an IDL proposal for SpeechInputResult (currently left empty in the
draft doc) and an example showing how it may be used.

Specifically this tries to address the following:

   1. Should work with both intermediate/preliminary and final/stable
   results
   2. Should allow for the speech service to give alternatives so user can
   tap on portions of the recognized text and select a different alternative to
   fill in.

IDL:

interface SpeechInputResult {
  readonly attribute Hypothesis[] prelim;
  readonly attribute Hypothesis[] stable;
  readonly attribute Alternative[] alternatives;
}


interface Hypothesis {
  readonly attribute DOMString utterance;
  readonly attribute float confidence;  // Range 0.0 - 1.0
}

In case of preliminary results, only .prelim is valid and is expected to be
non-empty.
- preliminary results give the recognition hypotheses for speech after the
last stable result (i.e. not relative to the last preliminary result)

In case of final (or stable as I call here) results, all 3 attributes may be
valid.
- .stable is expected to be non-empty here as if it was empty the 'nomatch'
event will be fired
- If .prelim is non-empty, these preliminary results are for the next run
(i.e. stable results were given for one part of the speech stream and the
prelim results for the speech after that).
- If .alternatives is non-empty, it is for the top stable result. We could
technically design the API to support alternatives for every single stable
hypothesis but the user is most likely to either change the whole recognized
phrase to a different one or correct parts of the top result and continue to
speak.

Every alternative item points to one segment in the top stable result and
gives the alternative hypotheses for that segment.


interface Alternative {
  readonly attribute int start;  // Index in the stable hypothesis'
utterance from where the below spans start
  readonly attribute AlternativeSpan[] spans;
}

interface AlternativeSpan {
  readonly attribute int length;  // Length of the span in the original
utterance which is replaced by the below array
  readonly attribute float confidence;  // Confidence value of the span in
the original utterance, range 0.0 - 1.0
  readonly attribute Hypothesis[] hypotheses;  // Other hypotheses for this
span in the original utterance
}


Example:

When the user speaks "testing this example", the web app may receive the
following sequence of SpeechInputResult objects in the onresult event
handler.


   1. {
     "prelim": [{ "text", 0.01 }]
   }
   2. {
     "prelim": [{ “test”, 0.99 }, { “sting”, 0.01 }]
   }
   3. {
     "prelim": [{ “testing”, 0.99 }, { “this”, 0.01 }]
   }
   4. {
     "prelim": [{ “testing”, 0.99 }, { “this”, 0.99 }]
   }
   5. {
     "stable": [{ “testing this”, 1.0 }, { “testing”, 0.1 }],
     // Alternatives are based on the “testing this” top scored stable
   result.
     "alternatives": [
       {
         "start": 0,
         "spans": [
           {
             "length": 4,
             "confidence": 0.9,
             "hypotheses": [{ “text”, 0.2 }, { “tent”, 0.1 }]
           },
           {
             "length": 12,
             "confidence": 0.6,
             "hypotheses": [{ “exit sis”, 0.02 }]
           }
         ]
       },
       {
         "start": 8,
         "spans": [
           {
             "length": 4,
             "confidence": 0.8,
             "hypotheses": [{ “these”, 0.6 }]
           }
         ]
       }
     ],
     // Speech after “testing this” belongs to a new independent recognition
   run.
     "prelim": [{ “ex”, 0.01 } , { “apple”, 0.01 }]
   }
   6. {
     "prelim": [{ “example”, 0.99 }]
   }
   7. {
     "stable": [{ “example”, 1.0 }]
   }

--
Cheers
Satish

Received on Wednesday, 21 September 2011 09:27:28 UTC