Re: [MINUTES] VC API 2025-03-25 from Manu Sporny on 2025-03-27 (public-credentials@w3.org from March 2025)

From: Manu Sporny <msporny@digitalbazaar.com>
Date: Thu, 27 Mar 2025 09:31:21 -0400
To: Steven Rowat <steven_rowat@sunshine.net>
Cc: "public-credentials@w3.org" <public-credentials@w3.org>
Message-ID: <CAMBN2CS45yiKL84_qvN-NPQi285YeOFnOh61nR0ks-R+2qeVgQ@mail.gmail.com>

On Wed, Mar 26, 2025 at 9:56 PM Steven Rowat <steven_rowat@sunshine.net> wrote:
> 1) Will you be able to 'flip that switch' for the CCG group transcripts also, to have this cleaner version?

Yes.

> 2) If the cleaner version is being produced by AI (I'm just guessing that it is), is there a danger, or have you noticed, any creep of hallucinations? In other words, is the transcript possibly less accurate in some way with the new system?

Both versions are being produced by AI, one of them is just doing it
with less information than the other version.

No, I don't think the new system will be less accurate, and here's why:

Both systems use variants of Google's LLMs to do voice recognition and
transcription.

The system we've been using with Jisti "streams" this information to
the LLM, and so has a limited LLM context window. It can "listen" for
about 15-30 seconds before its forced to spit out something that goes
in the log. This means it has less context to try and figure out what
the person is saying (it can't easily go back and remove "um"s and
"uh"s).

The system we're switching to has a much larger context window. How
large is unknown because Google doesn't usually announce these sorts
of trade secrets for their products, unless they decide that it'll be
good PR to do so. If their latest announcements in products are to be
believed, the context window can be as large as the entire recording,
which (theoretically) makes it far more accurate when selecting what a
person said. It does still make mistakes here and there, but
hallucinations are far fewer than they are with a streaming system
with a 30-second-ish context window.

I will also note that we've hit the point a year or two ago, where
auto-transcription of the sort we're using in the new system exceeds
the capabilities of 95% of most people that would scribe a
conversation.

The only thing we're missing is that a human has two things the new
system doesn't have: 1) domain knowledge of what we're doing in the
call -- special acronyms like DID, VC, W3C and IETF or names of people
speaking, and 2) the ability to summarize (or censor) what was said in
realtime to make it easier for others to understand or to protect the
speaker from what they're saying (e.g. "The idiocy of Jon's proposal
cannot be overstated, only authoritarian regimes would favor such a
proposal!" becomes "I do not find the proposal acceptable.").

All that said, I think things will improve in the ways that you are
hoping for them to improve, Steven. I'll also point out to others that
have questioned the value of having transcripts or recordings at all
over the years to Steven -- there are people that "participate" in the
CCG by listening to our calls on their commutes or reading the
transcripts after the fact. Again, it's why we put effort into these
systems -- they help people benefit from the work done in the
community (even when done so silently).

-- manu

--
Manu Sporny - https://www.linkedin.com/in/manusporny/
Founder/CEO - Digital Bazaar, Inc.
https://www.digitalbazaar.com/

Received on Thursday, 27 March 2025 13:32:02 UTC