Re: [w3ctag/design-reviews] On-device Web Speech API (Issue #1038) from Jeffrey Yasskin on 2025-04-15 (public-webapps-github@w3.org from April 2025)

From: Jeffrey Yasskin <notifications@github.com>
Date: Mon, 14 Apr 2025 20:42:23 -0700
To: w3ctag/design-reviews <design-reviews@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <w3ctag/design-reviews/issues/1038/2803693504@github.com>
jyasskin left a comment (w3ctag/design-reviews#1038)

We discussed this in a breakout today, and we're enthusiastic about helping websites recognize a user's speech on their local device, in line with our [data minimization guidance](https://www.w3.org/TR/privacy-principles/#data-minimization). Thank you for working on that.

With that said, we'd like to request several clarifications and API improvements:

#### Restricting recognition location

We see sites' desire to guarantee that speech isn't sent outside of the local device (for example to a cloud speech recognizer operated by the browser), especially for end-to-end-encrypted (E2EE) services. This is hard to actually guarantee: a valid browser architecture, especially for low-power devices, is to run the entire browser in the cloud and stream its UI down to the user. Similarly, a UA running on a low-power devices might prefer to run recognition in the cloud, perhaps to improve recognition quality. With those caveats, some of the TAG supports giving the site a way to encourage the browser to keep the speech local. Or, perhaps equivalently, for the site to tell the UA that it's going to keep the audio constrained to as few devices as possible, as a hint that the UA should do the same. This should not be a strong UA requirement, which means that it should not be called `ondevice-only`.

We don't see a reason to let sites require the recognition to happen in the cloud if the user prefers it to happen locally. Forcing a user to send unnecessary data to their UA's servers would violate our [data minimization principles](https://www.w3.org/TR/privacy-principles/#data-minimization). **So we recommend that `recognition.mode==cloud-only"` not exist.** Perhaps just `recognition.localOnly=true` or `recognition.onlyOnDevice=true`, or have sites check the return value of the `requestLocalRecognition([lang1, lang2, ...])` function suggested below?
  
We also note that there's an existing `SpeechSynthesis` API that mentions `localService` for voices. Could you add an analysis of how you're intentionally matching or diverging from that design?

#### Recognizing other users' speech

It's not clear whether it makes sense to have the browser download language packs to help websites recognize speech that comes from sources other than the current user speaking. This hesitance comes from two angles:
1. It seems more efficient to recognize speech once, and share the recognized text to all recipients, rather than having each recipient redundantly recognize it for themself. (https://w3c.github.io/sustainableweb-wsg/#success-criterion-client-vs-server-human-testable)
2. The download costs users something, and managing that makes the API significantly more complicated.

We could use a **better description of the use cases** that need these costs and complexity before we'll be comfortable endorsing that capability.

If this capability is justified, a site should request downloads of all of the needed languages in a single permission prompt. Without that, the user might misunderstand how much data they need to download before it would benefit them, which compromises their ability to [consent](https://w3ctag.github.io/design-principles/#consent). So, if this capability is needed, **we suggest an API more like `SpeechRecognition.requestLocalRecognition([lang1, lang2, ...])`**.

#### Personalization

In cases that are only recognizing the primary user's speech, we'd like the specification to analyze when it's safe and useful to personalize the speech recognition to that user. This could be as simple as "UAs MUST NOT personalize speech recognition", but we think there might be some utility in letting the [start()](https://webaudio.github.io/web-speech-api/#dom-speechrecognition-start) method that doesn't take a MediaStreamTrack be personalized in order to better recognize the user's speech. For fingerprinting and other privacy reasons, we think personalization is only feasible if the site can't get both the audio and the personalized recognized text, and there might be other risks or problems we haven't thought of.

#### Fingerprinting

We note that `availableOnDevice()` has some inherent fingerprinting risks, similar to [those created by `permissions.query()`](https://www.w3.org/TR/permissions/#privacy-considerations). The explainer currently just says this lets sites "determine whether to enable on-device features or fall back to cloud-based recognition", which would be enabled just as well by a better return value for the install function, probably getting inspiration from [the `Translator.create()` interface](https://github.com/webmachinelearning/translation-api?tab=readme-ov-file#download-progress) (cc/ @domenic). **What specific site UI needs this sort of no-UI query function?**
  
We noticed an additional fingerprinting risk, that the **exact version of the downloaded language pack is likely detectable and likely to skew** compared to the [browser's major version](https://wicg.github.io/ua-client-hints/#user-agent-significant-version). The specification should [identify that risk](https://www.w3.org/TR/fingerprinting-guidance/#mark-fingerprinting) and suggest or mandate ways to mitigate it. For example, to remove the risk entirely, you could require that each browser major version can only use 1 version of each language pack and that the packs are deleted when the user upgrades to an incompatible browser version. To only mitigate it, you could remove the `start(MediaStreamTrack)` overload or require [packs be deleted when storage is cleared](https://www.w3.org/TR/fingerprinting-guidance/#clearing-all-local-state).

#### UI and user choices

The explainer says "user agents must obtain explicit and informed user consent before installing", but that doesn't appear in the specification, and the algorithm for [installOnDevice()](https://webaudio.github.io/web-speech-api/#dom-speechrecognition-installondevice) just says the UA "can prompt the user". This seems insufficient.

While the old version of this API exists, it would be good for browsers supporting it to help their users get local recognition even on sites that haven't adopted the new options. Could the spec suggest this? Would an omnibox icon be an appropriate UI in your Chromium implementation?
  
Similarly, there might be cases where a user actively wants the recognition to happen in the cloud, for example if their battery is unusually low, they want improved cloud-based recognition, or (speculating) to use a particular cloud service if there are several options with privacy tradeoffs. The same UI might be appropriate for giving them that control.

#### Nits

* In comparing this API to the proposed [Translation API](https://github.com/webmachinelearning/translation-api?tab=readme-ov-file#destruction-and-aborting), we noticed that the proposed "download it" function here is missing an [AbortSignal parameter](https://w3ctag.github.io/design-principles/#aborting) to let the site abort the download.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/w3ctag/design-reviews/issues/1038#issuecomment-2803693504
You are receiving this because you are subscribed to this thread.

Message ID: <w3ctag/design-reviews/issues/1038/2803693504@github.com>
Received on Tuesday, 15 April 2025 03:42:27 UTC