- From: Manu Sporny <msporny@digitalbazaar.com>
- Date: Thu, 27 Mar 2025 09:31:21 -0400
- To: Steven Rowat <steven_rowat@sunshine.net>
- Cc: "public-credentials@w3.org" <public-credentials@w3.org>
On Wed, Mar 26, 2025 at 9:56 PM Steven Rowat <steven_rowat@sunshine.net> wrote: > 1) Will you be able to 'flip that switch' for the CCG group transcripts also, to have this cleaner version? Yes. > 2) If the cleaner version is being produced by AI (I'm just guessing that it is), is there a danger, or have you noticed, any creep of hallucinations? In other words, is the transcript possibly less accurate in some way with the new system? Both versions are being produced by AI, one of them is just doing it with less information than the other version. No, I don't think the new system will be less accurate, and here's why: Both systems use variants of Google's LLMs to do voice recognition and transcription. The system we've been using with Jisti "streams" this information to the LLM, and so has a limited LLM context window. It can "listen" for about 15-30 seconds before its forced to spit out something that goes in the log. This means it has less context to try and figure out what the person is saying (it can't easily go back and remove "um"s and "uh"s). The system we're switching to has a much larger context window. How large is unknown because Google doesn't usually announce these sorts of trade secrets for their products, unless they decide that it'll be good PR to do so. If their latest announcements in products are to be believed, the context window can be as large as the entire recording, which (theoretically) makes it far more accurate when selecting what a person said. It does still make mistakes here and there, but hallucinations are far fewer than they are with a streaming system with a 30-second-ish context window. I will also note that we've hit the point a year or two ago, where auto-transcription of the sort we're using in the new system exceeds the capabilities of 95% of most people that would scribe a conversation. The only thing we're missing is that a human has two things the new system doesn't have: 1) domain knowledge of what we're doing in the call -- special acronyms like DID, VC, W3C and IETF or names of people speaking, and 2) the ability to summarize (or censor) what was said in realtime to make it easier for others to understand or to protect the speaker from what they're saying (e.g. "The idiocy of Jon's proposal cannot be overstated, only authoritarian regimes would favor such a proposal!" becomes "I do not find the proposal acceptable."). All that said, I think things will improve in the ways that you are hoping for them to improve, Steven. I'll also point out to others that have questioned the value of having transcripts or recordings at all over the years to Steven -- there are people that "participate" in the CCG by listening to our calls on their commutes or reading the transcripts after the fact. Again, it's why we put effort into these systems -- they help people benefit from the work done in the community (even when done so silently). -- manu -- Manu Sporny - https://www.linkedin.com/in/manusporny/ Founder/CEO - Digital Bazaar, Inc. https://www.digitalbazaar.com/
Received on Thursday, 27 March 2025 13:32:02 UTC