Feedback to the DAP group on the topic of audio/media capture needed for HTML+Speech

On today's Hypertext Coordination Group Teleconference the issue of "Audio on the Web" was discussed (see minutes: http://www.w3.org/2011/01/14-hcg-minutes.html) and I was given the action item of contacting the DAP group to provide feedback about audio capture.  We in the HTML Speech XG (http://www.w3.org/2005/Incubator/htmlspeech/) have been discussing use cases, requirements, and some proposals around speech enabled html pages and the need for the audio to be captured and recognized in real time (I.e., in a streaming fashion, not in a file upload fashion).  We recognize that there are interesting security and privacy concerns with supporting this necessary functionality.

The HTML Speech XG has currently finished with requirements gathering, and is in the process of prioritizing these requirements.  Our requirements document is at http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html.  There are a large number (almost half) of our requirements that might be of particular note to the audio capture process.  I've tried to pull out and organize the requirements most relevant to the DAP audio capture:


*         Requirements about to where the audio is streamed:

o   FPR12. Speech services that can be specified by web apps must include network speech services [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr12]

o   FPR32. Speech services that can be specified by web apps must include local speech services. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr32]

*         Requirements about the audio streams and the fact that it needs to be streamed:

o   FPR33. There should be at least one mandatory-to-support codec that isn't encumbered with IP issues and has sufficient fidelity & low bandwidth requirements. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr33]

o   FPR25. Implementations should be allowed to start processing captured audio before the capture completes. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr25]

o   FPR26. The API to do recognition should not introduce unneeded latency. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr26]

o   FPR56. Web applications must be able to request NL interpretation based only on text input (no audio sent). [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr56]

*         Requirements about what must be possible while streaming (I.e., getting midstream events in a timely fashion without cutting off the stream; being able to decide to cut off the stream mid request; being able to reuse the stream):

o   FPR40. Web applications must be able to use barge-in (interrupting audio and TTS output when the user starts speaking). [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr40]

o   FPR21. The web app should be notified that capture starts. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr21]

o   FPR22. The web app should be notified that speech is considered to have started for the purposes of recognition. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr22]

o   FPR23. The web app should be notified that speech is considered to have ended for the purposes of recognition. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr23]

o   FPR24. The web app should be notified when recognition results are available. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr24]

o   FPR57. Web applications must be able to request recognition based on previously sent audio. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr57]

o   FPR59. While capture is happening, there must be a way for the web application to abort the capture and recognition process. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr59]

*         Requirements around the UI/API/Usability of speech/audio capture:

o   FPR42. It should be possible for user agents to allow hands-free speech input. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr42]

o   FPR54. Web apps should be able to customize all aspects of the user interface for speech recognition, except where such customizations conflict with security and privacy requirements in this document, or where they cause other security or privacy problems. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr54]

o   FPR13. It should be easy to assign recognition results to a single input field. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr13]

o   FPR14. It should not be required to fill an input field every time there is a recognition result. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr14]

o   FPR15. It should be possible to use recognition results to multiple input fields. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr15]

*         Requirements around privacy and security concerns:

o   FPR16. User consent should be informed consent. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr16]

o   FPR20. The spec should not unnecessarily restrict the UA's choice in privacy policy. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr20]

o   FPR55. Web application must be able to encrypt communications to remote speech service. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr55]

o   FPR1. Web applications must not capture audio without the user's consent. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr1]

o   FPR17. While capture is happening, there must be an obvious way for the user to abort the capture and recognition process. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr17]

o   FPR18. It must be possible for the user to revoke consent. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr18]

o   FPR37. Web application should be given captured audio access only after explicit consent from the user. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr37]

o   FPR49. End users need a clear indication whenever microphone is listening to the user. [http://www.w3.org/2005/Incubator/htmlspeech/live/requirements.html#fpr49]

We would be happy to discuss the details and context behind any of these requirements, and we'd also appreciate any feedback on our use cases and requirements.  I'm sure many of these are requirements the DAP group is already considering, but the speech use cases may well add some additional requirements that may not have yet been considered as part of the capture work.

The HTML Speech XG is also in the process of collecting proposals for our Speech API which we are planning to finish by the end of February.  In our discussions to date, we have reviewed and discussed some of the DAP capture API as well as some of the work that has gone on around the <device> tag proposals (We reviewed and discussed at least http://www.w3.org/TR/html-media-capture/ and http://www.w3.org/TR/media-capture-api/ and Robin provided the following links to more in progress work in the htcg call http://dev.w3.org/2009/dap/camera/ and http://dev.w3.org/2009/dap/camera/Overview-API.html).  In general I'd characterize our discussions as we would be extremely happy if we could reuse the DAP work, and would be happy to work with you on having proposals that meet this need.  To date in our review the large issue has been the streaming issue where the capture API is nearly useless to us if it doesn't support streaming.  But happily from today's htcg call it sounds like DAP is actively working on streaming so we strongly support that work direction, think it is extremely important, and will be interesting to see any and all work in that direction.

I'm not sure what the most productive next steps for us to take (email discussion back and forth, some HTML Speech XG members come to a DAP audio capture conference call, some DAP members come to a Speech XG teleconference, or something else).  In general, the HTML Speech XG tries to do most of our work over the public email alias and we also have a schedule-as-needed Thursday teleconference time for 90 minutes starting at noon New York time.

Thanks, and look forward to working on this important topic with you!

Michael Bodell (Microsoft)
Co-chair HTML Speech XG

Received on Saturday, 15 January 2011 05:46:43 UTC