Fwd: [whatwg] Speech input element from James Salsman on 2013-06-13 (www-voice@w3.org from April to June 2013)

From: James Salsman <jsalsman@gmail.com>
Date: Thu, 13 Jun 2013 14:49:58 +0800
To: Kazuyuki Ashimura <ashimura@w3.org>
Cc: www-voice@w3.org
Message-ID: <CAD4=uZbOJ+KD9JSQrb9W6er52MdbEw1xB7Q=SoSOMFAk0Byx1Q@mail.gmail.com>

Hi Kazuyuki,

What is the status of my request to include phoneme time segmentation
and phoneme confidence scores in SRGS?

I am unable to find any implementation of it in
http://www.w3.org/TR/speech-grammar/ which is still dated 2004.

(cc to www-voice in case anyone there knows.)

Best regards,
James Salsman


---------- Forwarded message ----------
From: James Salsman <jsalsman@gmail.com>
Date: Thu, May 27, 2010 at 5:49 AM
Subject: Re: [whatwg] Speech input element
To: Bjorn Bringert <bringert@google.com>
Cc: whatwg@lists.whatwg.org, Kazuyuki Ashimura <ashimura@w3.org>


Bjorn,

Thank you for your reply:

>> On Mon, May 17, 2010 at 8:55 AM, Bjorn Bringert <bringert@google.com> wrote:
>>>
>>>> - What exactly are grammars builtin:dictation and builtin:search?
>>>
>>> They are intended to be implementation-dependent large language
>>> models, for dictation (e.g. e-mail writing) and search queries
>>> respectively. I've tried to clarify them a bit in the spec now. There
>>> should perhaps be more of these (e.g. builtin:address), maybe with
>>> some optional, mapping to builtin:dictation if not available.

Is the difference that search is open vocabulary, and dictation isn't?

>> Bjorn, are you interested in including speech recognition support for
>> pronunciation assessment such as is done by http://englishcentral.com/
>> , http://www.scilearn.com/products/reading-assistant/ ,
>> http://www.eyespeakenglish.com/ , and http://wizworldonline.com/ ,
>> http://www.8dworld.com/en/home.html ?
>> Those would require different sorts of language models and grammars
>> such as those described in
>> http://www.springerlink.com/content/l0385t6v425j65h7/
>>
>> Please let me know your thoughts.
>
> I don't have SpringerLink access, so I couldn't read that article. As
> far as I could tell from the abstract, they use phoneme-level speech
> recognition and then calculate the edit distance to the "correct"
> phoneme sequences. Do you have a concrete proposal for how this could
> be supported?

I've attached my most recent submission to the Conversational Systems
Workshop.  In the second paragraph of the Proposal section, I've
pointed out enhancements to the W3C Recommendations you mentioned for
pronunciation assessment.  I am still not sure whether the three W3C
speech recognition Recommendations make any kind of provisions for
multi-pass recognition, for example on a sub-segment of the originally
recorded speech.  I am not sure whether you use that in dictation or
speech recognition for open vocabulary search, but perhaps you might.

I noticed you included two methods in
http://docs.google.com/Doc?docid=0AaYxrITemjbxZGNmZzc5cHpfM2Ryajc5Zmhx&hl=en#Methods_8485730991680402_9459536422838603
-- do you think there should be general accessors for the underlying
speech audio samples, as well as the recognized phoneme string,
per-phoneme, and maybe per-word acoustic confidence scores, and
beginning and end time points for the recognized phonemes and lexemes
in the enhanced recognition results contemplated in the attached
position?

Best regards,
James Salsman

Attachments

application/pdf attachment: convAppsSalsman-final.pdf

Received on Thursday, 13 June 2013 06:50:28 UTC