Re: Organized first draft of Use Case and Requirements Document

Hi Michael,

The draft looks good and is a great starting point. Some comments below:

> U3. Domain Specific Grammars Contingent on Earlier Inputs

This feels like a subset of the 'U5. Domain Specific Grammars Filling
Multiple Input Fields' use case. Is there a reason to have them as 2
different use cases?

> U7. Rerecognition

I think rerecognition is an advanced feature and perhaps something to
explore at a later stage. Should we consider removing it from this
list for now (since it is not an exhaustive list anyway) ?

> U8. Voice Activity Detection

While the UA may use voice activity detection either at start of
speech or stop or both, I don't think it should be exposed as a part
of the HTML speech API. In particular I'm concerned about letting
random web pages initiate speech input automatically and get
recognition results without the user consent. This is the reason our
draft proposal to the XG required explicit user action to start
recognition (for e.g. by clicking on a mic button).

The main use case mentioned here is hands-free dialog, which could be
performed by using the continuous recognition but having the user
initiate the dialog with a gesture (click, touch or something
similar). This assures the user that third parties cannot snoop on
their speech/audio without their knowledge, while also avoiding
unwanted notifications and permission popups (since users are known to
dismiss them without paying much attention).

I see "R29" in the security section mention this, so I wonder if U8
conflicts that.

> U13. Dialog Systems

The last 2 sentences of this use case talk about implementation
details (VXML, XMLHttpRequest, ..) and I think we should remove them.
The use case itself seems fine as dialog based systems are useful in
various contexts, but in a web app the dialog system could be
implemented in many ways including plain Javascript, interpreting a
proprietary dialog description language, interpreting VXML in
Javascript, a UA which natively supports VXML and so on.

> R1. Web author needs full control over specification of speech resources
> R16. Web application authors must not be excluded from running their own speech service

I think these two requirements should be removed and left to the UA.
In my mind speech input equates to keyboard and other forms of input,
and each user prefers to use a device/platform/OS which suits them
best. Some people use an english keyboard while others in their native
tongue, some use IME software to manage both at the same time, some
get ergonomic keyboards etc. In all these cases the applications and
websites are happy to just work with the text/data they are given.

In the speech input context the recognizer becomes part of the
device/platform/OS and not something chosen by a web page. In real
life terms having the UA manage speech resources means:
- the user gets consistent results each time whatever website they visit
- the recognizer can train itself to perform better over time since
all the user's speech goes through it
- the user can easily upgrade/purchase new software if they need to
improve on that.

I see R31 mention about this as well.

> R11. Web application author must integrate input from multiple modalities

Should this be 'Web application author must be able to integrate input
from multiple modalities' ?

> R13. Web application author should have ability to customize speech recognition graphical user interface

This is a tricky one to get right, because the ability to
customisation could also lead to lesser security and bring up privacy
implications. For e.g. a web page could have a button which says
'Click to win $100' and when the user clicks it can silently start
recording audio. In my opinion the speech input UI, notifications and
progress updates should not be customisable and instead should be
rendered by the UA in the same manner across all web pages. This lets
users be aware when random web pages start recording their speech and
they can take appropriate action.

We could still let the web page have a customisable interface for
starting speech input, as long as the rest of the speech input UI
experience is managed by the UA. I see "R31" talk about this, so I
wonder if it conflicts with R13.

> R17. User perceived latency of recognition must be minimized
> R18. User perceived latency of synthesis must be minimized

Is it realistic to have these as requirements, especially since the
recognizer or synthesizer could very well be a remote server?

Cheers
Satish



On Mon, Oct 4, 2010 at 11:55 AM, Michael Bodell <mbodell@microsoft.com> wrote:
> I've now taken the original collated list of 70 use cases and requirements
> from
> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2010Sep/0051.html and
> created a first draft of a document that combines like use cases and
> requirements and organizes the remaining 15 use cases and 34 requirements
> into different related sections.  I've also (generously) linked the
> requirements back to the use cases that support them.  For contribution I
> also took the style of the VBWG (everyone listed in the editors section, not
> a separate editors and authors section), appologies in advance if I missed
> someone, I took the people who were linked in the earlier collation above
> (and I wasn't sure what the organization was for the two people who aren't
> members of the XG).
>
>
>
> As always, if there are some use cases or requirements that could be made
> more clear or added, that would be great.
>
>
>
> For a next step I've asked Dan to consider running a poll that will help us
> prioritize the use cases and requirements so we can start by focusing the
> discussion on the use cases and requirements that have the highest priority.

Received on Thursday, 7 October 2010 15:03:53 UTC