[minutes] 2 June 2011 telecon

Group,

The minutes from last week's call are available at http://www.w3.org/2011/06/02-htmlspeech-minutes.html 
.

For convenience, a text version is embedded below.

Thanks to Michael Johnston for taking the minutes!

-- dan

****************************************************************************************************

Attendees

    Present
           Dan_Burnett, Milan_Young, Marc_Schroeder, Robert_Brown,
           Patrick_Ehlen, Charles_Hemphill, Satish_Sampath, Glen_Shires,
           Michael_Johnston, Olli_Pettay, Michael_Bodell,
           Bjorn_Bringert, Dan_Druta, Debbie_Dahl, Raj_Tumuluri

    Regrets
    Chair
           Dan Burnett

    Scribe
           Michael_Johnston

Contents

      * [4]Topics
          1. [5]updated final report document
          2. [6]agreed upon design decisions
          3. [7]additional issues to add to list of issues
          4. [8]markup binding
          5. [9]crucial decisions partially discussed
          6. [10]do we support audio streaming and how?
          7. [11]What is meant by "start of speech", "end of speech",
             and endpointing in general? How do transmission delays
             affect the definitions and what we want in terms of APIs?
      * [12]Summary of Action Items
      _________________________________________________________


    <burn_> Agenda:
    [13]http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun
    /0006.html

      [13] http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Jun/0006.html

    burn: start with review of face to face minutes, will review again
    next week
    ... comments on minutes

updated final report document

    burn: comments on draft at this point?

    all: silence

agreed upon design decisions

additional issues to add to list of issues

    michael: does move to have emma document in dom, remove impetus for
    json variant of emma

    bjorn: have simple javascript api for accessing most common
    elements, dont need json variant of emma, for details can access
    emma object

    milan: need to do xml parsing?

    bodell: will be much the same as other http requests that return
    xml, dont need to parse

    milan: are mobile devices a problem, verbosity of xml

    bodell: no

    dan: is there pressure from this group to build a json version of
    emma?

    all: agreement: no push for json version of emma

    burn: any other issues to add to list for discussion

markup binding

    bjorn: no feedback from chrome team yet

    bodell: keep html binding lightweight, js constructor, simple "for"
    mechanism, small work to define, if dont want then remove the
    element
    ... should not mess up js api

    olli: problem with for attribute it what it can point to, what
    elements can be used as target, doesnt quite work with content
    editable, important use case
    ... clarifies issue, need to make clear which elements can be
    targets and what the semantics is
    ... also content editable areas

    michael: have to define semantics when target is e.g. a drop down or
    radio button

    olli: may be new kinds of elements also

    bodell: assumption would be to bind to any element, but they would
    not all have to work,

    bodell; some browsers would want to handle more input types

    olli: reco would be element in the dom, what is the benefit of the
    reco
    ... if for is not used

    bodell: google desire to have element with microphone click api

    bjorn: have proposed several things along the way,
    ... most important aspect is to have an element you can click to
    start speaking without the pop up or info bar

    olli: no clear what the element gives

    robert: follow up with chrome folks

    bjorn: still waiting on that
    ... do agree that html element discussion does not block the js api
    discussion

    olli: issue may get solved along the way

    burn: need to see concrete proposal to make decision

crucial decisions partially discussed

    <marc>
    [14]http://www.w3.org/2005/Incubator/htmlspeech/2011/05/f2fminutes20
    1105.html

      [14] http://www.w3.org/2005/Incubator/htmlspeech/2011/05/f2fminutes201105.html

    burn: will go through each ...

    bjorn: audio capture topic is dealt with, should be default way, if
    there is an audio capture api will deal with then

    burn: audio codecs mandatory

    robert: even IP status around speex is unclear also
    ... are only reasonable answers pcm and mulaw, despite their flaws

    bjorn: flac, high bandwidth

    <bringert> FLAC

    <bringert>
    [15]http://en.wikipedia.org/wiki/Free_Lossless_Audio_Codec

      [15] http://en.wikipedia.org/wiki/Free_Lossless_Audio_Codec

    milan: speex is in ietf draft on how to package in rtp

    <Milan> [16]http://tools.ietf.org/html/draft-ietf-avt-rtp-speex-07

      [16] http://tools.ietf.org/html/draft-ietf-avt-rtp-speex-07

    burn: rtp way to send it does not mean there are ip issues

    bjorn: need to require some codecs or cant be interoperable

    milan: problems sounds similar

    bodell: rfc does not have a patent policy

    burn: if something is necessary to implement the spec, and it is
    encumbered with IP, need to make that clear

    bjorn: need protocol for interoperability

    milan: protocol for RTC

    burn: opus, codecs from two organizations, trying to blend, not
    clear if IP issues are being resolved, making container
    ... can use either one if you have permission
    ... dont have an answer yet, really need one, industry wide problem,
    may not be ours to solve, return to this

    <mbodell> See
    [17]http://lists.xiph.org/pipermail/speex-dev/2003-November/000753.h
    tml for some similar discussion on patent of speex

      [17] http://lists.xiph.org/pipermail/speex-dev/2003-November/000753.html

    robert: will follow up re: speex again

    milan: impact on protocol team if need to negotiate codec

    marc: speex is not good enough for tts

    <marc> ogg vorbis

    <marc> and FLAC

    burn: few names as candidates flac, ogg vorbis, speex, pcm

    bjorn: already use flac in launched clients

    olli: ogg vorbis is core html audio

    <smaug> not core HTML audio. Some browsers just happen to support it

    burn: candidates to consider flac, ogg vorbis, speex, pcm

do we support audio streaming and how?

    burn: think we expect streaming, less clarity on how

    milan: sending audio on regular time intervals as it is collected or
    generated

    bjorn: discussed how to get events while capturing
    ... how it is done is a protocol question

    burn: asr may begin before the user is
    ... finished speaking, result before engine comes

    milan: without regular timed packets, wont get events on regular
    interval

    bjorn: latency is what is app observable

    bodell: having multiple events is not a big problem
    ... data in events can deal with timing

    milan: if app is realtime, five seconds ago go this event

    bjorn; agree, what we need is low latency, not sure what we can
    require, part of being a good implementation

    burn: market takes care of product requirements

    robert: fair to say that standard should not have inherent
    limitations
    ... 50 ms or so is the threshold

    bjorn: protocol design should not make it impossible to achieve low
    latency event delivery

    marc: audio streaming in the tts case?
    ... send audio while still rendering rest of an long utterance

    bodell: tts is generally fast enough that this is not a problem

    marc: if tts has to process all text before returning audio, could
    be a problem,
    ... wants to make sure that what we create here does not prevent an
    implementation doing this

    bjorn: up to engine whether it starts to synthesize

    marc: wav format, header has filesize, makes proper streaming

    bjorn: protocol should make it possible for the tts to be streamed
    and start playing before
    ... synthesis is complete

    burn: issue of supporting format coming back in video and
    ... and playing the audio

    bjorn: should not require playing audio from video

    robert: api should not prevent this

    burn: video with three audio tracks, how does apis select

    robert: our proposal separated capture api from reco, could support
    different kinds of capture

    burn: protocol design should not preclude streaming of video codecs

    raj: why specify video?

    robert: if codec can be packetized in real time should be ok

    burn: the protocol should not inhibit the tranmission of codecs that
    have similar requirements to audio?

What is meant by "start of speech", "end of speech", and endpointing in
general? How do transmission delays affect the definitions and what we
want in terms of APIs?

    bodell: issue of latency impacting times

    bjorn: agreed UA being basis for the clock

    burn: dont have requirements for timing info from server

    bodell: tts case?

    bjorn: seems reasonable for server to include timing info

    robert: could do offset from start

    burn: something that UA can convert into UA local timestamp
    ... different ways to achieve that
    ... doesnt say what is made available in the api

    bodell; many different times, when the utterance start etc

    bodell: when received,

    marc: impact on order that events are received

    milan: will UA generate these events when using remote service

    bodell: may assume energy detector gives you end of speech, before
    reco gives end of speech, hard to guarantee order

    milan: start of energy is different than start of speech
    ... hard to write web app if get two start of speech events

    bodell: different events,

    bodell; was fixed order for the non continuous case

    charles: could arrange fixed order delivery, even if times inside do
    not reflect this

    bodell; no practical to hold events and put them in the desired
    order

    burn: energy detector gets end of sound, then will get actual end of
    speech with better timing info, either get two or through away
    better info

    marc: dont want to override better info from remote service

    burn: front is for optimization so dont have to send all the audio

    bodell: events could be in different orders
    ... not convinced in having standard order

    milan: UA only have sound start, sound end
    ... avoid duplication,

    bodell; already have different event names

    robert: in name need to make clear some events are from energy
    ... detector others are from speech reco

    milan: source of events

    bodell: unmake statement about specific ordering

    milan: new statement that user agent can insert are energy related
    events

    marc: and probably capture start and end

    charles: seems strong since speech service might or might not be
    remote

    burn: removed ordering
    ... energy detector can only generate sound start stop

    burn; speech service can only deliver the speech start stop

    charles; if not order can be guarantee delivery

    burn: how to guarantee it '

    milan: as long as have single source for events

    michael: (need a blackboard for this)

    bodell: solved by removing required ordering
    ... allows all the use cases
    ... also works with continuous case
    ... thought had solved the issue

    burn: but start before end?
    ... can get end without having seen a start

    milan: reluctant to give up the ordering, if have single source for
    each type of event

    burn: agreed speech service can only generate one, can't guarantee
    that they wont cross in time

    milan: use remote speech service as the canonical

    bodell: easiest to understand cross for end, UA would raise both
    events in the order they occurred

    milan: it is possible to impose an ordering
    ... pros and cons, flexibility, or predictability for the web app
    developer

    bjorn: events from the same source should be in the same order

Received on Wednesday, 8 June 2011 00:09:35 UTC