Re: Feedback to the DAP group on the topic of audio/media capture needed for HTML+Speech from Rich Tibbett on 2011-01-17 (public-device-apis@w3.org from January 2011)

From: Rich Tibbett <richt@opera.com>
Date: Mon, 17 Jan 2011 09:07:34 +0100
To: Michael Bodell <mbodell@microsoft.com>
CC: "public-device-apis@w3.org" <public-device-apis@w3.org>, "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Message-ID: <4D33F8C6.4020000@opera.com>
Hi Michael,

Interesting stuff. Perhaps I can have a go at clarifying how I see all 
these pieces fitting together going forward...

Michael Bodell wrote:
> On today’s Hypertext Coordination Group Teleconference the issue of
> “Audio on the Web” was discussed (see minutes:
> http://www.w3.org/2011/01/14-hcg-minutes.html) and I was given the
> action item of contacting the DAP group to provide feedback about audio
> capture. We in the HTML Speech XG
> (http://www.w3.org/2005/Incubator/htmlspeech/) have been discussing use
> cases, requirements, and some proposals around speech enabled html pages
> and the need for the audio to be captured and recognized in real time
> (I.e., in a streaming fashion, not in a file upload fashion). We
> recognize that there are interesting security and privacy concerns with
> supporting this necessary functionality.
>
> The HTML Speech XG has currently finished with requirements gathering,
> and is in the process of prioritizing these requirements. Our
> requirements document is at
> http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html.
> There are a large number (almost half) of our requirements that might be
> of particular note to the audio capture process. I’ve tried to pull out
> and organize the requirements most relevant to the DAP audio capture:
>
> · Requirements about to where the audio is streamed:
>
> o FPR12. Speech services that can be specified by web apps must include
> network speech services
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr12]
>
> o FPR32. Speech services that can be specified by web apps must include
> local speech services.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr32]
>
> · Requirements about the audio streams and the fact that it needs to be
> streamed:
>
> o FPR33. There should be at least one mandatory-to-support codec that
> isn't encumbered with IP issues and has sufficient fidelity & low
> bandwidth requirements.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr33]
>
> o FPR25. Implementations should be allowed to start processing captured
> audio before the capture completes.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr25]
>
> o FPR26. The API to do recognition should not introduce unneeded
> latency.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr26]
>
> o FPR56. Web applications must be able to request NL interpretation
> based only on text input (no audio sent).
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr56]
>
> · Requirements about what must be possible while streaming (I.e.,
> getting midstream events in a timely fashion without cutting off the
> stream; being able to decide to cut off the stream mid request; being
> able to reuse the stream):
>
> o FPR40. Web applications must be able to use barge-in (interrupting
> audio and TTS output when the user starts speaking).
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr40]
>
> o FPR21. The web app should be notified that capture starts.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr21]
>
> o FPR22. The web app should be notified that speech is considered to
> have started for the purposes of recognition.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr22]
>
> o FPR23. The web app should be notified that speech is considered to
> have ended for the purposes of recognition.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr23]
>
> o FPR24. The web app should be notified when recognition results are
> available.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr24]
>
> o FPR57. Web applications must be able to request recognition based on
> previously sent audio.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr57]
>
> o FPR59. While capture is happening, there must be a way for the web
> application to abort the capture and recognition process.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr59]
>
> · Requirements around the UI/API/Usability of speech/audio capture:
>
> o FPR42. It should be possible for user agents to allow hands-free
> speech input.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr42]
>
> o FPR54. Web apps should be able to customize all aspects of the user
> interface for speech recognition, except where such customizations
> conflict with security and privacy requirements in this document, or
> where they cause other security or privacy problems.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr54]
>
> o FPR13. It should be easy to assign recognition results to a single
> input field.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr13]
>
> o FPR14. It should not be required to fill an input field every time
> there is a recognition result.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr14]
>
> o FPR15. It should be possible to use recognition results to multiple
> input fields.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr15]
>
> · Requirements around privacy and security concerns:
>
> o FPR16. User consent should be informed consent.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr16]
>
> o FPR20. The spec should not unnecessarily restrict the UA's choice in
> privacy policy.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr20]
>
> o FPR55. Web application must be able to encrypt communications to
> remote speech service.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr55]
>
> o FPR1. Web applications must not capture audio without the user's
> consent.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr1]
>
> o FPR17. While capture is happening, there must be an obvious way for
> the user to abort the capture and recognition process.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr17]
>
> o FPR18. It must be possible for the user to revoke consent.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr18]
>
> o FPR37. Web application should be given captured audio access only
> after explicit consent from the user.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr37]
>
> o FPR49. End users need a clear indication whenever microphone is
> listening to the user.
> [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr49]
>
> We would be happy to discuss the details and context behind any of these
> requirements, and we’d also appreciate any feedback on our use cases and
> requirements. I’m sure many of these are requirements the DAP group is
> already considering, but the speech use cases may well add some
> additional requirements that may not have yet been considered as part of
> the capture work.
>
> The HTML Speech XG is also in the process of collecting proposals for
> our Speech API which we are planning to finish by the end of February.
> In our discussions to date, we have reviewed and discussed some of the
> DAP capture API as well as some of the work that has gone on around the
> <device> tag proposals (We reviewed and discussed at least
> http://www.w3.org/TR/html-media-capture/ and
> http://www.w3.org/TR/media-capture-api/ and Robin provided the following
> links to more in progress work in the htcg call
> http://dev.w3.org/2009/dap/camera/ and
> http://dev.w3.org/2009/dap/camera/Overview-API.html). In general I’d
> characterize our discussions as we would be extremely happy if we could
> reuse the DAP work, and would be happy to work with you on having
> proposals that meet this need. To date in our review the large issue has
> been the streaming issue where the capture API is nearly useless to us
> if it doesn’t support streaming. But happily from today’s htcg call it
> sounds like DAP is actively working on streaming so we strongly support
> that work direction, think it is extremely important, and will be
> interesting to see any and all work in that direction.

The Media Capture specs (HTML and API) are certainly intended for 
on-demand capture of audio and video files rather than the streaming use 
case.

AFAICS, your requirements are perhaps most suited to work around the 
<device> element [1] and the currently informal RTC (Real time 
Communication) Web group [2]. In this group you will find a number of 
initial discussions that are happening around user-generated streaming 
audio (from microphone) and video (from webcam) for the web. This group 
is in the process of formalizing both an IETF and W3C Charter which I'm 
sure will be of interest to you going forward. I believe streaming use 
cases will fall under this dedicated charter.

>
> I’m not sure what the most productive next steps for us to take (email
> discussion back and forth, some HTML Speech XG members come to a DAP
> audio capture conference call, some DAP members come to a Speech XG
> teleconference, or something else). In general, the HTML Speech XG tries
> to do most of our work over the public email alias and we also have a
> schedule-as-needed Thursday teleconference time for 90 minutes starting
> at noon New York time.

The streaming requirements you mentioned above are very useful but could 
they initially be considered out of scope for the Speech XG?

An alternative, simpler jumping off point may be to pivot around the 
<audio> element, considering that the result of any local audio 
streaming is almost certainly likely to be piped via such an element for 
playback and/or manipulation by the current page. The only missing piece 
is having microphone data piped through to such an element, which is 
likely to take considerably more time than your own road map allows, but 
which is highly likely to interface with HTML elements such as <audio>.

Starting from the <audio> element has the added benefit of reusing the 
existing APIs being considered in the Audio Incubator Group [3] and then 
being able to provide Speech APIs as appropriate from that base. The 
Speech APIs may also then be applicable to a wide-range of additional 
content that may not only be streamed from the user's device but from 
pre-recorded files and the general web (which would also aid with an 
initial proof for testing purposes in browsers of today I believe).

The other groups you will want to be talking with include WHATWG, RTCWEB 
and DAP. All of these groups are very much working in these areas and I 
would encourage submission of your requirements to at least the RTCWEB 
initiative for consideration in their chartering process. I would then 
also suggest the re-basing your requirements around the <audio> element 
as I described briefly above.

>
> Thanks, and look forward to working on this important topic with you!

I'll leave the chairs/staff contacts to clarify what can be done but 
hope we can work with you on this also.

This is just my initial 2 cents to help better frame this discussion.

- Rich

[1] 
http://www.whatwg.org/specs/web-apps/current-work/multipage/commands.html#devices
[2] http://rtc-web.alvestrand.com/
[3] http://www.w3.org/2005/Incubator/audio/wiki/Main_Page

-- 
Rich Tibbett
Opera Software ASA
Received on Monday, 17 January 2011 08:08:13 UTC