[minutes] 22 September 2011 from Dan Burnett on 2011-09-28 (public-xg-htmlspeech@w3.org from September 2011)

From: Dan Burnett <dburnett@voxeo.com>
Date: Wed, 28 Sep 2011 18:38:40 -0400
To: public-xg-htmlspeech@w3.org
Message-Id: <A6B0DE30-FA00-449D-9678-E4F81077DFE3@voxeo.com>
Group,

The minutes from last week's call are available at http://www.w3.org/2011/09/22-htmlspeech-minutes.html.

For convenience, a text version is embedded below.

Thanks to Satish Sampath for taking the minutes.

-- dan

**********************************************************************************

              HTML Speech Incubator Group Teleconference

22 Sep 2011

   [2]Agenda

      [2] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0036.html

   See also: [3]IRC log

      [3] http://www.w3.org/2011/09/22-htmlspeech-irc

Attendees

   Present
          Dan_Burnett, Olli_Pettay, Debbie_Dahl, Robert_Brown,
          Dan_Druta, Bjorn_Bringert, Satish_Sampath, Michael_Bodell,
          Glen_Shires, Patrick_Ehlen, Milan_Young, Charles_Hemphill,
          Michael_Johnston

   Regrets
   Chair
          Dan_Burnett,Michael_Bodell

   Scribe
          Satish_Sampath

Contents

     * [4]Topics
         1. [5]Continuation of the Web API discussion
         2. [6]IDL for SpeechInputRequest sent earlier
         3. [7]continuous reco attribute
         4. [8]filtering offensive words attribute
     * [9]Summary of Action Items
     _________________________________________________________


   burn: first topic is TPAC and we will likely have work to be done in
   the webapi and protocol in a face to face and some work on the
   document. It is highly likely we'llh ave significant discussions and
   we'll have 2 full days.
   ... number of people who register determines the place and number of
   power outlets, so please register

   <glen> Meetings at TPAC Nov 3-4 Santa Clara, CA
   [11]http://www.w3.org/2011/11/TPAC/Overview.html

     [11] http://www.w3.org/2011/11/TPAC/Overview.html

   <glen> Register by Oct 14 for lower fee

   <glen> Best hotel rates / rooms by Oct 10

   burn: the two days that matter for us are thursday/friday

Continuation of the Web API discussion

   <mbodell>
   [12]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep
   /0033.html

     [12] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0033.html

IDL for SpeechInputRequest sent earlier

   <burn> satish: reviews his IDL proposal (see link above)

   bringert: could start with the saveWaveformURI and inputWaveformURI
   questions
   ... could be part of the MediaStream 'input' attribute, may not need
   a separate URI

   robert: doesn't address where the waveform is at a remote server

   bringert: what is the use case?

   robert: re-reco is one use case where audio was saved
   ... may have a 3 second utterance and don't want to upload it again

   <mbodell> This comes from "FPR57. Web applications must be able to
   request recognition based on previously sent audio."

   bringert: seems like a corner case and adds complexity to
   implementation
   ... could be a random unique token instead of a uri
   ... shouldn't say in the api that the uri should be downloadable and
   fetch the full audio as a file
   ... perhaps replace it with a rerecognize method

   Debbie: what if user wants to listen to what they said?

   bringert: could be a UA feature instead of an API requirement

   Glen: use cases: 1. listen to yourself, 2: re-reco with same
   service, 3: re-reco with a different service

   robert: 3 is important because one request could be setup with one
   grammar, another could be with another grammar, app could use output
   of step 1 to figure out the correct set of grammars for the second
   step
   ... could add a rereco method which takes in a set of parameters for
   the second reco
   ... would be doing all this in one thread with event handlers and
   won't have time to do async stuff
   ... e.g. a local search app with a coarse grammar identifying
   states,cities and based on the result decide which granular grammar
   to use the neighbourhood

   Milan: the issue is whether the second reco takes place in the same
   service. if it does then that service can perform a rereco. only if
   using a different service it is a problem

   burn: another use case - compliance. may be a need for the client to
   say i want to save these recos to get to them later. the only module
   which can identify is the service

   bringert: could be proprietary extensions

   mbodell: all use cases are solvable if we keep the uri as is and
   mention it is only for identifying the audio and not download audio
   content

   burn: not talking about rereco, only for client to identify sessions

   bringert: is it realistic to have all implementors to keep this
   stored all the time?

   robert: why need to get the recording back?

   burn: client doesn't do all of the endpointing, only recognizer
   knows what it got. For compliance you may need an entire recording
   and sometimes need to know specifically what was heard.

   bringert: e.g. calling stock broker and say sell, then i sue them
   for selling and they prove i actually said it?

   burn: yes

   robert: way to solve is that this is a specialized app and have the
   service provider record all audio anyway and provide a session id to
   client..
   ... really hard to solve all such use cases
   ... we can just provide a way to tag the session

   bringert: could solve session id and rereco by returning an opaque
   session id in the reco result, which can be passed up as a parameter

   burn: happy with that if we also have an api to get the audio in the
   client for the session

   robert: don't understand why end user needs to listen to what
   recognizer heard
   ... speech service could provide an orthogonal api for fetching all
   data for a given session id

   bringert: this is quite common and we do it for debugging, not for
   end users

   burn: not sure that end user will need it, ui/mic tuning can be done
   offline

   mbodell: helpful if audio can be obtained easily without doing
   something complicated. another use case - smart answering machine
   which transcribes and fall backs to the recorded audio if dictation
   wasn't successful

   robert: what is the logic for such a webapp?

   bringert: capture audio, send to server and cache locally, if
   response is fine send as email and otherwise send captured audio

   mbodell: may want to listen to your audio before sending
   ... so should be easy to play back sent audio

   bringert: all of this can be done with media capture api

   robert: this is like a mic api and we decided earlier to avoid that

   bringert: so I propose we remove save/inputWaveformURI and instead
   add a sessionId in the response. Also add a way to pass this for
   rereco

   mbodell: makes sense for saveWaveformURI, inputWAveformURI is a
   different use case
   ... rereco is not the only use case. e.g recognize something
   recorded a long time ago
   ... or audio stored elsewhere

   smaug: mediastream will allow that

   bringert: agree

   burn: requires the client fetch and process the file contents
   itself, turn into a stream and pass to the server

   <mbodell> s/robert: mediastream/Olli: mediastream/

   mbodell: has an issue with bandwidth usage

   bringert: having specific apis to tell one service to talk to
   another service/uri adds complexity and security

   mbodell: i don't buy both those reasons

   robert: there are security problems as we have 3 entities now and
   all have to share security context. it is possible to do out of band

   mbodell: if audio is in a private intranet could use mediastream api

   <burn> mbodell: but there is much audio that is publicly available
   and could be fetched directly

   bringert: is the use case like transcribing a youtube audio/video ?
   if writing that webapp instead of a service which fetches and
   transcribes once instead of in a webapp?
   ... doesn't seem like a web application, not efficient

   mbodell: similar to specifying a grammar, this may not be different
   than that

   bringert: yes they are similar, just that use case is a lot weaker
   and there are other ways to accomplish the same thing
   ... since more than one person would be interested in transcribing
   publicly available audio.

   mbodell: don't agree with that, easy to do if you own the service
   ... other protocols like MRCP already require such functionality.
   agree that there are other ways but that is the wrong optimisation.

   bringert: probably not a big concern, use case feels pointless and
   its another feature but not hard to implement
   ... but there is the codec issue

   mbodell: could be figured out in protocol handshake

   robert: in protocol group it came to uLaw and PCM as required codecs

   mbodell: same discussion will happen in synthesis api so not unique
   to this context

   bringert: could use the same uri mechanism for rereco

   robert: what would be the header when fetching the uri, that'll
   specify the codec used?

   bringert: assume standard http response headers would have the mime
   type or audio contains magic bytes to tell what codec is used
   ... session id idea still stands and will be returned in the
   recognition result and request will take this id as an optional
   field. inputWaveformURI refers to a normal uri on the web
   ... though rereco can fail if the id goes stale or service doesn't
   support storing audio
   ... related boolean field present is 'saveForRereco' so webapp
   specifies in advance if it wants storing and rereco

   <mbodell> Summary: remove saveWaformURI; keep inputWaveformURI with
   normal URI/http semantics; add a session id (format unknown - URI
   that isn't necessarily a URL?) to the result; Add ability to rereco
   from session id

   robert: a counter proposal is to let service not send sessionId if
   it doesn't support saving audio
   ... and rereco could be done by saving audio locally with
   mediastream
   ... leave the flag as an optional optimisation.

   bringert: good point, the result could always return a sessionId and
   a separate flag 'savedForRereco' will be set to true if server
   supported that feature
   ... so sessionId is always present and can be used for logging etc

   mbodell: should the separate variable/flag be a boolean or some
   other token?

   bringert: could just use sessionId for referring to saved audio

   mbodell: useful to differentate audio chunks in continuous reco,
   whereas sessionId could refer to the whole session

   robert: rereco should allow specifying a time range

   bringert: what if i get 2 results and i want to rereco the whole
   audio covering both results?

   robert: could specify time range in the rereco method
   ... between starting and finishing a recognition there is continuous
   recording of audio and you have an audio token. that might be
   different each time you cycle that request.

   bringert: what audio does it refer to? from start to stop?

   robert: yes

   bringert: for rereco could pass in audioId, start and stop
   ... rereco should be a separate method

   <smaug> terrible echo

   robert: doesn't think so, instead of using mic input should use
   saved audio
   ... same as starting reco in the normal case otherwise

   bringert: what do stop and abort mean if you start rereco

   robert: could call abort if result didn't come soon enough and you
   want to cancel

   bringert: this will need 3 new attributes, rerecognizeFromId,
   rerecognizeFromStart, rerecognizeFromEnd or could be an object with
   3 attributes

   michael: could also reuse inputWaveformURI

   Milan: are we saying 3 attributes are better than 1 new method?

   robert: better than having 2 ways to do reco, better way is to say
   where to get the audio from (local or saved)
   ... similar to what we have specified in the protocol api work

   satish: should we talk about the 2 new attributes added to the IDL?

   <robert>
   [13]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep
   /att-0012/speech-protocol-draft-05.htm#reco-headers

     [13] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/att-0012/speech-protocol-draft-05.htm#reco-headers

   mbodell: sounds fine to me, need a way to specify continuous reco

   <robert>
   [14]http://example.com/retainedaudio/fe429ac870a?interval=0.3,2.86

     [14] http://example.com/retainedaudio/fe429ac870a?interval=0.3

   <robert> this is an example of a wave uri with time intervals:
   [15]http://example.com/temp44235.wav?interval=0.65,end

     [15] http://example.com/temp44235.wav?interval=0.65

   <mbodell> A different example might be:
   sessionid:foobar?interval=0.3,2.5

   <robert> and here's another:
   [16]http://example.com/retainedaudio/fe429ac870a?interval=0.3,2.86

     [16] http://example.com/retainedaudio/fe429ac870a?interval=0.3

   bringert: I'll go back on my earlier concern, seems fine to use the
   inputWaveformURI for rereco from an earlier session and recognizing
   from publicly accessible audio
   ... even for public URI should allow passing media fragments/time
   range

   burn: the URI should just be something that the service can access

   bringert: for continuous reco, have we talked about how results
   would be received?

continuous reco attribute

   mbodell: we have a simple proposal and satish sent one for complex
   scenario, should discuss both

   robert: which isthe simple proposal?

   bringert: probably the last one I sent to the mailing list
   ... sent on Aug 25, subject 'web api discussion in today's call'

   <mbodell>
   [17]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug
   /0033.html

     [17] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/0033.html

   <bringert> satish's proposal for results API for continuous reco:
   [18]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep
   /0034.html

     [18] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.html

filtering offensive words attribute

   bringert: 2 situations - filtering from language model so '***' gets
   recognized as 'duck' and the second one could send results back as
   'f***'
   ... for first could just choose a different grammar

   robert: why can't we use grammar for both?
   ... could even be a 'builtin:dictation?noOffensiveWords'

   Glen: this feels like a user selection
   ... than a website selectable setting

   mbodell: this is the mechanism to communicate this setting to the
   service

   bringert: the problem is about misrecognizing something as offensive
   words - even random noise gets recognized as an offensive word

   glen: agree that grammar could be the mechanism but should the web
   app specify it or should the UA?

   burn: agree with glen, happens to me all the time with autocorrect
   and if it annoys me I turn it off
   ... this is something the browser should provide as a setting and
   not the web app

   mbodell: if i'm in an adult site it is not useful to send a flag to
   speech service saying don't send me back naughty words

   bringert: as an example, we have a global flag on android to not
   return offensive words. there seem to be uses who don't mind
   offensive words and those who don't want

   burn: users may be willing to input offensive words in some sites
   and not in some

   satish: e.g. you never want to send offensive words in an office
   email web app

   glen: we may need both, as a user setting and a web app setting

   robert: grammar should be enough

   glen: if using a custom grammar you are defining your own words

   bringert: UA could do it like how it does spell check and only pass
   sanitized results to the web app if it wants

   mbodell: so conclusion is to leave it out of the IDL
   ... and allow a way to pass a hint via the grammar
   ... something like 'builtin:dictation?noOffensiveWords'
Received on Wednesday, 28 September 2011 22:39:20 UTC