Minutes HTML Speech XG f2f
4 Nov, morning session

Present: Olli Petay, Dan Burnett, Raj Tumuluri, ?? (Nuance), Araki, Michael Bodell, Robert Brown, Jim Barnett, ??, Satish Sampath, Bjorn Bringert, Debbie Dahl, Marc Schroeder, Paolo Baggia
Scribe: Marc

Picking up discussion from email: Same-domain and cross-domain issues: sending data across domains is a privacy concern, also security or DOS attack issue. This is an area that will require more discussion.

Requirements.

Lets start with a non-controversial one. 
R3: Ability to bind results to specific input fields.

Bjorn: Will be needed but does not need to be in the spec.
... It is obvious that scripts on the page will get access to results, and then they can fill the fields.
Raj: If everything can be done in scripts, what do we need these requirements for?
Bjorn: Requirements about the output of this group; impossible to list everything one could do by speech, such as changing the colour of the page.
Dan: Maybe it's more about making sure that it's easy to do so, rather than "it's possible to do".
MichaelB: Agree. The list was drawn up for things that should be easy and direct.
Bjorn: Then, maybe reword: one field, or multiple fields?
Olli: Do we need this at all?
Dan: Here, there would be no copying involved.
Bjorn: How about multiple fields?
Bjorn: We could distinguish "easy" and "possible". Would filling one field be "easy", and filling multiple fields be "possible"?
Robert: The demo you (Bjorn) showed the other day showed it was easy to fill multiple fields.
Dan: Similarly, "recognition result" or "results"? N-best? 
Robert: Is this related to mixed initiative (Use case 5)?
... maybe that's a different requirement.
MichaelB: Compare to keyboard input. Easy to fill one input field by typing on the keyboard. It could be done, using scripts, to type and thereby fill several input fields. Maybe speech could work in the same way.
Bjorn: I am fine with any wording here that doesn't require us to do what VoiceXML does, multiple binding to slots.
Olli: Don't want ASR results to be bound to any input field. In my experience, X+V was terrible because it required ASR results to be bound to input fields. It's about the API. Think of events coming from the server; scripts can do something with that.
Robert: Does the word "bind" imply too much?
Dan: Would it be ok for you to have an "automatic" binding, if it is not the only one?
Olli: That would mean you have two different APIs.
Dan: Yes; we have that today for text too.
Robert: Do you want a requirement in here that says it must never be automatic?
Olli: No.

Dan: I think R3 needs to be split into two different requirements, representing different use cases: Single field vs. multiple fields. Let's talk about a single input field first.
Bjorn: (1) It should be easy to assign recognition rsults to a single input field. 
(2) It should not be required to fill an input field every time there is a recognition result.
(3) It should be possible to use recognition results to multiple input fields.
Dan: Can we all accept that wording for now?
Agreement, while acknowledging that this is papering over some difficulties.


R33: User agents need a way to enable end users to grant permission to an application to listen to them.

This is about the user not having to confirm every single time.
On Tuesday, we agreed that it is up to the UA, and then the user just chooses the UA they want.
Dan: So maybe we don't need this requirement at all, it is already covered by R29.
Bjorn: Did we actually reach an agreement on that?
Michael: As I remember it, we discussed R29 + R33, and decided it was up to the UA policy somehow.
Dan: We did change the wording of R29 to "capture". And R33 is about ways in which the user can give consent.
(some discussion of various aspects)
No doubt there is the requirement for the user to give consent in some way.
Wording proposed by Bjorn: "Web applications must not capture audio without the user's consent."
Agreed.

Separate issue is the *mechanism* how the user gives consent.  There are different possible ways how the user can give that consent, e.g. by clicking ok every time, or by downloading the UA that does capture, or a range of options in between.

Bjorn: I don't want google.com to pop up a window saying "do you want to allow this page to capture your speech", because 99.9% of users will not want to, and I don't want to inconvience them.
Dan: Example of "file upload" button is good: It cannot be initiated by the web app.
Bjorn: Yes, but then once the user clicks the button, there is no window popping up asking "do you want to allow this?".
Michael: I may install a firefox plugin that warns me every time a character I type is sent to a server. I wouldn't want the spec to forbid that.
Jeremy: This seems specific to the UA to me. Let different UAs to experiment with different options.
Michael: Agree: The requirement is: It must be possible for the UA to prompt the user whether to allow this.
Bjorn: And, it must be possible to start speech input in response to user action such as clicking the microphone button.
Marc: Think of "informed consent" as in psychological experiments, as needed to get an experiment design through the ethics committee. If you click the button "send me 1000$", that should not trigger speech input. On the other hand, clicking a microphone button could be considered informed consent.
Jeremy: Yes, important to avoid click-jacking.
Dan+Bjorn: Notification and cancelling speech input must be possible, but that are different requirements.
Dan: "It must be possible for the user to revoke consent at any time, including while capturing".
?? (Nuance): Nervous about the backing out part.
Dan: You can never get back audio that was sent.

To summarise. 
New requirement: "User consent should be informed consent."
Agreed.

Now, about notification and cancellation.
R32 covers the notifcation.
Let's work on the cancellation.

Bjorn proposes wording: "While capture is happening, there must be an obvious way for the user to abort the capture and recognition process."
Agreed.
Abort should be defined in the spec, explanation such as "as soon as you can, stop capturing, stop processing for recognition, and stop processing any recognition results". Two aspects to this: privacy; and the app gets as little as possible.

Michael: Separate from this: "It must be possible for the user to revoke consent" (in general, not for the specific capture event).
Agreed.


Topic: New requirement: User-initiated speech input.
Robert: "User-initiated speech input should be possible."
Agreed.

Bjorn: And web apps needs a way to inform that UA that they support speech input.

Topic: Privacy policy.
New requirement.
Bjorn: "The spec should not unnecessarily restrict the UA's choice in privacy policy."
Raj: Don't like the word "unnecessarily" -- it is too vague.
Marc: should guide us in writing the spec, not end up in the spec.
Raj: That's ok, as long as the word "unnecessarily" ends up in the spec.

So these new requirements now replace R29 and R33.

--------

R4: Web application must be notified when recognition occurs
Why do we have this requirement? Is it not obvious?
- the UA should not just fill fields without the web app knowing
- start notification vs. end notification, sequence of events

Relevant events (order not yet specified): audio capture starts; speech starts; end of speech; end of capture; recognition results available

New requirements replacing R4:
"The web app should be notified that capture starts."

"The web app should be notified that speech is considered to have started for the purposes of recognition."

"The web app should be notified that speech is considered to have ended for the purposes of recognition."

"The web app should be notified when recognition results are available."
(these may be partial results, and may occur several times)

Agreed.