- From: Satish Sampath <satish@google.com>
- Date: Thu, 7 Oct 2010 16:03:20 +0100
- To: Michael Bodell <mbodell@microsoft.com>
- Cc: "public-xg-htmlspeech@w3.org" <public-xg-htmlspeech@w3.org>
Hi Michael, The draft looks good and is a great starting point. Some comments below: > U3. Domain Specific Grammars Contingent on Earlier Inputs This feels like a subset of the 'U5. Domain Specific Grammars Filling Multiple Input Fields' use case. Is there a reason to have them as 2 different use cases? > U7. Rerecognition I think rerecognition is an advanced feature and perhaps something to explore at a later stage. Should we consider removing it from this list for now (since it is not an exhaustive list anyway) ? > U8. Voice Activity Detection While the UA may use voice activity detection either at start of speech or stop or both, I don't think it should be exposed as a part of the HTML speech API. In particular I'm concerned about letting random web pages initiate speech input automatically and get recognition results without the user consent. This is the reason our draft proposal to the XG required explicit user action to start recognition (for e.g. by clicking on a mic button). The main use case mentioned here is hands-free dialog, which could be performed by using the continuous recognition but having the user initiate the dialog with a gesture (click, touch or something similar). This assures the user that third parties cannot snoop on their speech/audio without their knowledge, while also avoiding unwanted notifications and permission popups (since users are known to dismiss them without paying much attention). I see "R29" in the security section mention this, so I wonder if U8 conflicts that. > U13. Dialog Systems The last 2 sentences of this use case talk about implementation details (VXML, XMLHttpRequest, ..) and I think we should remove them. The use case itself seems fine as dialog based systems are useful in various contexts, but in a web app the dialog system could be implemented in many ways including plain Javascript, interpreting a proprietary dialog description language, interpreting VXML in Javascript, a UA which natively supports VXML and so on. > R1. Web author needs full control over specification of speech resources > R16. Web application authors must not be excluded from running their own speech service I think these two requirements should be removed and left to the UA. In my mind speech input equates to keyboard and other forms of input, and each user prefers to use a device/platform/OS which suits them best. Some people use an english keyboard while others in their native tongue, some use IME software to manage both at the same time, some get ergonomic keyboards etc. In all these cases the applications and websites are happy to just work with the text/data they are given. In the speech input context the recognizer becomes part of the device/platform/OS and not something chosen by a web page. In real life terms having the UA manage speech resources means: - the user gets consistent results each time whatever website they visit - the recognizer can train itself to perform better over time since all the user's speech goes through it - the user can easily upgrade/purchase new software if they need to improve on that. I see R31 mention about this as well. > R11. Web application author must integrate input from multiple modalities Should this be 'Web application author must be able to integrate input from multiple modalities' ? > R13. Web application author should have ability to customize speech recognition graphical user interface This is a tricky one to get right, because the ability to customisation could also lead to lesser security and bring up privacy implications. For e.g. a web page could have a button which says 'Click to win $100' and when the user clicks it can silently start recording audio. In my opinion the speech input UI, notifications and progress updates should not be customisable and instead should be rendered by the UA in the same manner across all web pages. This lets users be aware when random web pages start recording their speech and they can take appropriate action. We could still let the web page have a customisable interface for starting speech input, as long as the rest of the speech input UI experience is managed by the UA. I see "R31" talk about this, so I wonder if it conflicts with R13. > R17. User perceived latency of recognition must be minimized > R18. User perceived latency of synthesis must be minimized Is it realistic to have these as requirements, especially since the recognizer or synthesizer could very well be a remote server? Cheers Satish On Mon, Oct 4, 2010 at 11:55 AM, Michael Bodell <mbodell@microsoft.com> wrote: > I've now taken the original collated list of 70 use cases and requirements > from > http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0051.html and > created a first draft of a document that combines like use cases and > requirements and organizes the remaining 15 use cases and 34 requirements > into different related sections. I've also (generously) linked the > requirements back to the use cases that support them. For contribution I > also took the style of the VBWG (everyone listed in the editors section, not > a separate editors and authors section), appologies in advance if I missed > someone, I took the people who were linked in the earlier collation above > (and I wasn't sure what the organization was for the two people who aren't > members of the XG). > > > > As always, if there are some use cases or requirements that could be made > more clear or added, that would be great. > > > > For a next step I've asked Dan to consider running a poll that will help us > prioritize the use cases and requirements so we can start by focusing the > discussion on the use cases and requirements that have the highest priority.
Received on Thursday, 7 October 2010 15:03:53 UTC