RE: Requirement for UA / SS protocol from Young, Milan on 2010-11-19 (public-xg-htmlspeech@w3.org from November 2010)

From: Young, Milan <Milan.Young@nuance.com>
Date: Fri, 19 Nov 2010 08:20:05 -0800
To: "Bjorn Bringert" <bringert@google.com>, "Robert Brown" <Robert.Brown@microsoft.com>
Cc: <public-xg-htmlspeech@w3.org>
Message-ID: <1AA381D92997964F898DF2A3AA4FF9AD09631111@SUN-EXCH01.nuance.com>
First the use cases:

Web-app to SS events - The user wants to place a call, but can't quite remember the name of the contact.  The visual UI scrolls through the list of contacts and sends an event each time a new contact is displayed.  The recognizer uses this information to weight the recognition result because it's likely the user will speak a name it has just seen.

Re-recognition using previous audio - The user has requested directory assistance to find a residential phone number.  They thought the contact lived in City-A, but no relevant results found.  They want to try again in City-B, and shouldn't have to be asked again for the utterance.

Interpretation over text - A common approach in speech processing is to use a large statistical or speaker-dependent model to identify the lexical nature of the tokens.  A second pass over the data extracts meaning from the tokens.  At present, this second pass seems to be a more difficult task, and sometimes several attempts need to be made each with a different base context.  For example the word "bill" might refer to a financial transaction, a duck, a hat, or a person.



Other responses:

FPR11 - I'd like to add a second sentence to this statement (either in summary or text).  "This includes both TBD standard and extension parameters."

FPR28 and 29 - Perhaps these statements could be slightly adjusted with "... fire implementation-specific events to the web app".

FPR4 - EMMA seems like a sufficiently flexible standard to handle all foreseeable needs.  Perhaps now is not the time, but I'd like to see if we can agree to make this a required part of the protocol.


Thanks



-----Original Message-----
From: Bjorn Bringert [mailto:bringert@google.com] 
Sent: Friday, November 19, 2010 3:17 AM
To: Robert Brown
Cc: Young, Milan; public-xg-htmlspeech@w3.org
Subject: Re: Requirement for UA / SS protocol

We already had a requirement that there must be a standard protocol.
If I understand this list correctly, it adds a number of requirements
on what features this standard protocol must support. I propose that
we consider each of the bullet points a separate requirement, so that
they can be discussed independently.

I think that most of them look fine. The only two that I'm not sure about are:

- web-app -> speech service events, with the same objection that Robert raised.

- Re-recognition using previous audio streams. What's the use case for this?


Also, I think that the following are already covered by existing requirements:

- "Both standard and extension parameters passed from the web app to
the speech service at the start of the interaction.  List of standard
parameters TBD."
  Covered by "FPR11. If the web apps specify speech services, it
should be possible to specify parameters."

- The speech service -> web app part of the birirectional events
requirement is covered by:

FPR21. The web app should be notified that capture starts.
FPR22. The web app should be notified that speech is considered to
have started for the purposes of recognition.
FPR23. The web app should be notified that speech is considered to
have ended for the purposes of recognition.
FPR24. The web app should be notified when recognition results are available.
FPR28. Speech recognition implementations should be allowed to fire
implementation specific events.
FPR29. Speech synthesis implementations should be allowed to fire
implementation specific events.

- "EMMA results passed from the SS to the web app.  The syntax of this
result is TBD (e.g. XML and/or JSON)."
Covered by:
FPR4. It should be possible for the web application to get the
recognition results in a standard format such as EMMA.

- "Interpretation over text."
Covered by (if I understand it correctly):
FPR2. Implementations must support the XML format of SRGS and must support SISR.


So, the remaining requirements from Milan's list that I support adding are:

* At least one standard audio codec.  UAs are permitted to advertise
alternate codecs at the start of the interaction and SSs are allowed
to select any such alternate (e.g. HTTP Accept).

* Transport layer security (e.g. HTTPS) if requested by the web app.

* Session identifier that could be used to form continuity across
multiple interactions (e.g. HTTP cookies).

/Bjorn

On Fri, Nov 19, 2010 at 1:49 AM, Robert Brown
<Robert.Brown@microsoft.com> wrote:
> I mostly agree.  But do we need bidirectional events?  I suspect all the
> interesting ones originate at the server: start-of-speech; hypothesis;
> partial result; warnings of noise, crosstalk, etc.  I'm trying to think why
> the server would care about events from the client, other than when the
> client is done sending audio (which it could do in response to a click or
> end-point detection).
>
>
>
> From: public-xg-htmlspeech-request@w3.org
> [mailto:public-xg-htmlspeech-request@w3.org] On Behalf Of Young, Milan
> Sent: Thursday, November 18, 2010 5:34 PM
> To: public-xg-htmlspeech@w3.org
> Subject: Requirement for UA / SS protocol
>
>
>
> Hello,
>
>
>
> On the Nov 18th conference, I volunteer to send out proposed wording for a
> new requirement:
>
>
>
>
>
>
>
> Summary - User agents and speech services are required to support at least
> one common protocol.
>
>
>
>
>
>
>
> Description - A common protocol will be defined as part of the final
> recommendation.  It will be built upon some TBD existing application layer
> protocol and include support for the following:
>
>
>
>   * Streaming audio data (e.g. HTTP 1.1 chunking).  This include both audio
> streamed from UA to SS during recognition and audio streamed from SS to UA
> during synthesis.
>
>
>
>   * Bidirectional events which can occur anytime during the interaction.
> These events could originate either within the web app (e.g. click) or the
> SS (e.g. start-of-speech or mark) and must be transmitted through the UA in
> a timely fashion.  The set of events include both standard events defined by
> the final recommendation and extension events.
>
>
>
>   * Both standard and extension parameters passed from the web app to the
> speech service at the start of the interaction.  List of standard parameters
> TBD.
>
>
>
>   * EMMA results passed from the SS to the web app.  The syntax of this
> result is TBD (e.g. XML and/or JSON).
>
>
>
>   * At least one standard audio codec.  UAs are permitted to advertise
> alternate codecs at the start of the interaction and SSs are allowed to
> select any such alternate (e.g. HTTP Accept).
>
>
>
>   * Transport layer security (e.g. HTTPS) if requested by the web app.
>
>
>
>   * Session identifier that could be used to form continuity across multiple
> interactions (e.g. HTTP cookies).
>
>
>
>   * Interpretation over text.
>
>
>
>   * Re-recognition using previous audio streams.
>
>
>
>
>
>
>
> Thank you
>
>



-- 
Bjorn Bringert
Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
Palace Road, London, SW1W 9TQ
Registered in England Number: 3977902
Received on Friday, 19 November 2010 16:20:45 UTC