Re: [w3ctag/design-reviews] On-device Web Speech API (Issue #1038) from Evan Liu on 2025-04-18 (public-webapps-github@w3.org from April 2025)

From: Evan Liu <notifications@github.com>
Date: Fri, 18 Apr 2025 11:28:07 -0700
To: w3ctag/design-reviews <design-reviews@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <w3ctag/design-reviews/issues/1038/2815982645@github.com>

evanbliu left a comment (w3ctag/design-reviews#1038)

We appreciate the TAG's thoughtful and thorough feedback on our proposal to add on-device speech recognition support to the Web Speech API. We discussed this feedback at the monthly Audio Working Group meeting with representatives from **Google and Mozilla**. Below, we address each of the major concerns raised:
### 1. Restricting Recognition Location
We accept TAG's recommendation to **remove the `cloud-only` option** from the API. This aligns with the current direction of implementation across browsers—Firefox has no plans to support cloud-only recognition, and Chrome is also moving away from it.
That said, we acknowledge that cloud-based speech recognition may still be preferred in certain situations, such as:
* On **low-power devices** that lack sufficient compute resources.

* When **resource-intensive on-device features** are already in use.

* When **cloud recognition offers better quality** in specific contexts.

Importantly, we must also support use cases where **audio must not be sent to third-party services**, such as for **regulatory or compliance reasons**. While we acknowledge that **confidential computing** may eventually offer a viable solution for such cases, we do **not intend to support that path at this time**. Instead, we aim to design the API to be **extensible**, so that support for confidential computing can be added later if needed.

To accommodate current needs, **user agents that cannot guarantee local-only processing may throw an error**, allowing websites to make informed decisions based on the level of assurance required.
### 2. Recognizing Other Users' Speech
We recognize the efficiency argument in centralized recognition (e.g., WebRTC). However, we believe **MediaStreamTrack support enables flexibility**, allowing:
* **Sender-side recognition**, where the originator of speech recognizes and distributes captions.

* **Receiver-side recognition**, for scenarios where personalized or local processing is preferred.

We're open to **extending the installation API to support multiple language packs** in a single call (e.g., `installOnDevice([lang1, lang2, ...])`), improving both user experience and consent clarity.
We note, however, that the current API returns a `Promise<boolean>` indicating installation success. Supporting multiple languages could introduce ambiguity if one language fails. We will explore a more expressive return format to address this.
### 3. Personalization
We accept the TAG's recommendation and will add language to the specification stating that:

**User agents MUST NOT personalize speech recognition.**

We agree that this approach avoids fingerprinting and privacy risks while keeping the API simpler and more secure.
### 4. Fingerprinting
We acknowledge the fingerprinting risks posed by functions like `availableOnDevice()`. To mitigate these risks:
* We will **align with the Web Translation API’s privacy-preserving approach**.

* **Both Chrome and Firefox will support only one language pack per language at a time**, reducing the variability that could lead to fingerprinting.

* **Language packs will be cleared when browser storage is cleared**, ensuring consistent behavior with other privacy controls.

For Chrome specifically, we will **implement the same fingerprinting mitigations** used in the Web Translation API. These are detailed in the following document: [Fingerprinting Mitigations](https://docs.google.com/document/d/1-9m-oe1x34nM2mCTzPsnSZIlSfYQXbZ7_SrvtMD1wuw/edit?usp=sharing).
### 5. UI and User Choice
We acknowledge that some users may prefer cloud-based recognition in certain scenarios—for example, to conserve battery or access higher-quality recognition. In principle, we support giving users control over this choice through appropriate UA-level mechanisms.

However, **neither Chrome nor Firefox currently plan to expose explicit UI-level controls for this**. Firefox is committed to supporting only on-device speech recognition. Chrome is also moving away from cloud-based recognition and plans to phase it out over time.

The specification will be updated to use the phrasing **"may prompt the user"** to allow **user agents the flexibility** to implement privacy-preserving countermeasures in a way that best fits their platform and user base, without prescribing a specific UI. For instance, **Chrome will rely on non-UI mechanisms** to mitigate fingerprinting risks, rather than requiring explicit user prompts.
### 6. Nits
We are **open to adding an `AbortSignal` parameter** to the download/install function, consistent with modern web platform design.

We note that for Chrome's implementation of the Web Translation API, aborting the request **does not cancel the actual download** for privacy preserving reasons, but rather **stops associated download progress events**.

--
Reply to this email directly or view it on GitHub:
https://github.com/w3ctag/design-reviews/issues/1038#issuecomment-2815982645
You are receiving this because you are subscribed to this thread.

Message ID: <w3ctag/design-reviews/issues/1038/2815982645@github.com>

Received on Friday, 18 April 2025 18:28:11 UTC