Re: [w3ctag/design-reviews] On-device Web Speech API (Issue #1038) from Jeffrey Yasskin on 2025-05-06 (public-webapps-github@w3.org from May 2025)

From: Jeffrey Yasskin <notifications@github.com>
Date: Mon, 05 May 2025 20:02:10 -0700
To: w3ctag/design-reviews <design-reviews@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <w3ctag/design-reviews/issues/1038/2853142041@github.com>

jyasskin left a comment (w3ctag/design-reviews#1038)

Thank you for digging into the details here with us and for making several changes to the API. We'd like to reiterate that we're excited to see browsers add the ability to recognize a user's speech on their device.

* The simple prohibition on personalization makes sense for now, and it doesn't prevent a future iteration of the API from exploring what personalization might be safe and useful.
* We're glad to see the change to `SpeechRecognition.installOnDevice([lang1, lang2, ...])`, although as you noted, **this should probably have a more detailed return type than `Promise<boolean>`**.
* We're glad to see that ["cloud-only" has been removed from the explainer](https://github.com/WebAudio/web-speech-api/pull/153).

We agree that you're on the right track in the following areas:

* It makes sense to allow recipients in [End-to-End Encrypted (E2EE)](https://en.wikipedia.org/wiki/End-to-end_encryption) systems to recognize remote speech, both to protect the recipient's privacy (so they don't need to disclose to other participants that they're using captions) and to support senders who don't have the capability to recognize their own speech (likely because they're using an old browser or a limited device; people calling in by phone can't participate in E2EE systems, and so there's a good argument that the server should recognize their speech to avoid [redundant processing](https://w3c.github.io/sustainableweb-wsg/#success-criterion-client-vs-server-human-testable)). We'd appreciate if this justification could be added to the explainer. We'd also like the specification to caution websites that **it's inappropriate to use this feature to offload recognition costs from a server to users' devices unless the cost is accompanied by a compensating benefit**.
* It makes sense to allow websites to distinguish when local recognition is available immediately, after a download, or not at all, so that they can show UI that gives users the right expectations. We also appreciate that you're re-using the guidelines in https://webmachinelearning.github.io/writing-assistance-apis/#privacy. Note that this interacts with the `"ondevice-only"` option, and the spelling of `availableOnDevice()` might need to change to match the changes we're suggesting below.

We also have one remaining concern that we'd like to see addressed before this ships: **we think "ondevice-only" is probably the wrong category to serve the desired use cases.**

The [described Google Meet use for this](https://github.com/w3ctag/design-reviews/issues/1038#issuecomment-2837046998) is particularly inappropriate: there's no reason to think that the local models for speech recognition will have the same performance characteristics across different user agents, whether "performance" means recognition speed, energy use, or accuracy. If Meet wants to insist on using a particular quality of recognition model, it's going to need to evaluate what individual UAs do with each option, rather than making assumptions based on Chromium's behavior. This, in turn, raises risks of UA sniffing vs [feature detection](https://w3ctag.github.io/design-principles/#feature-detect), and **we'd appreciate if the WG could explore whether there's a more direct way to support these standards for each of the quality axes**.

We do think there's a reasonable use case in ensuring that audio data isn't spread around more widely than it needs to be, but "ondevice" doesn't seem to capture the right distinction. In [discussion](https://github.com/w3ctag/meetings/blob/gh-pages/2025/telcons/04-28-minutes.md#design-reviews1038-on-device-web-speech-api---jyasskin), we identified 5 levels of exposure, and there may be more:

A. The same device that the UA is running on.
B. Encrypted connections to other devices that the same user owns, for example to allow a phone to offload work to the same person's desktop.
C. Various flavors of private computation on other people's computers, ranging from homomorphic encryption to [TEEs](https://en.wikipedia.org/wiki/Trusted_execution_environment).
D. "Service providers" (in the GDPR sense) for the user, who have committed not to use or expose the data outside that particular user's context.
E. Distinct data controllers (again in the GDPR sense), who could use the user's data for their own purposes. This secondary use might help subsidize free or cheaper service, but it also makes this category challenging.

"ondevice" implies at most level A, but the E2EE and regulatory use cases could handle at least level B and probably level C. **We'd like the WG to be more explicit about what's needed here, and to pick a term that captures the right distinctions.**

--
Reply to this email directly or view it on GitHub:
https://github.com/w3ctag/design-reviews/issues/1038#issuecomment-2853142041
You are receiving this because you are subscribed to this thread.

Message ID: <w3ctag/design-reviews/issues/1038/2853142041@github.com>

Received on Tuesday, 6 May 2025 03:02:15 UTC