- From: James Salsman <jsalsman@gmail.com>
- Date: Thu, 13 Jun 2013 14:49:58 +0800
- To: Kazuyuki Ashimura <ashimura@w3.org>
- Cc: www-voice@w3.org
- Message-ID: <CAD4=uZbOJ+KD9JSQrb9W6er52MdbEw1xB7Q=SoSOMFAk0Byx1Q@mail.gmail.com>
Hi Kazuyuki, What is the status of my request to include phoneme time segmentation and phoneme confidence scores in SRGS? I am unable to find any implementation of it in http://www.w3.org/TR/speech-grammar/ which is still dated 2004. (cc to www-voice in case anyone there knows.) Best regards, James Salsman ---------- Forwarded message ---------- From: James Salsman <jsalsman@gmail.com> Date: Thu, May 27, 2010 at 5:49 AM Subject: Re: [whatwg] Speech input element To: Bjorn Bringert <bringert@google.com> Cc: whatwg@lists.whatwg.org, Kazuyuki Ashimura <ashimura@w3.org> Bjorn, Thank you for your reply: >> On Mon, May 17, 2010 at 8:55 AM, Bjorn Bringert <bringert@google.com> wrote: >>> >>>> - What exactly are grammars builtin:dictation and builtin:search? >>> >>> They are intended to be implementation-dependent large language >>> models, for dictation (e.g. e-mail writing) and search queries >>> respectively. I've tried to clarify them a bit in the spec now. There >>> should perhaps be more of these (e.g. builtin:address), maybe with >>> some optional, mapping to builtin:dictation if not available. Is the difference that search is open vocabulary, and dictation isn't? >> Bjorn, are you interested in including speech recognition support for >> pronunciation assessment such as is done by http://englishcentral.com/ >> , http://www.scilearn.com/products/reading-assistant/ , >> http://www.eyespeakenglish.com/ , and http://wizworldonline.com/ , >> http://www.8dworld.com/en/home.html ? >> Those would require different sorts of language models and grammars >> such as those described in >> http://www.springerlink.com/content/l0385t6v425j65h7/ >> >> Please let me know your thoughts. > > I don't have SpringerLink access, so I couldn't read that article. As > far as I could tell from the abstract, they use phoneme-level speech > recognition and then calculate the edit distance to the "correct" > phoneme sequences. Do you have a concrete proposal for how this could > be supported? I've attached my most recent submission to the Conversational Systems Workshop. In the second paragraph of the Proposal section, I've pointed out enhancements to the W3C Recommendations you mentioned for pronunciation assessment. I am still not sure whether the three W3C speech recognition Recommendations make any kind of provisions for multi-pass recognition, for example on a sub-segment of the originally recorded speech. I am not sure whether you use that in dictation or speech recognition for open vocabulary search, but perhaps you might. I noticed you included two methods in http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhx&hl=en#Methods_8485730991680402_9459536422838603 -- do you think there should be general accessors for the underlying speech audio samples, as well as the recognized phoneme string, per-phoneme, and maybe per-word acoustic confidence scores, and beginning and end time points for the recognized phonemes and lexemes in the enhanced recognition results contemplated in the attached position? Best regards, James Salsman
Attachments
- application/pdf attachment: convAppsSalsman-final.pdf
Received on Thursday, 13 June 2013 06:50:28 UTC