- From: James Salsman <james@bovik.org>
- Date: Mon, 4 Aug 1997 19:40:15 -0700 (PDT)
- To: www-html@w3.org
The most important reason that I support the HTML forms
extension <input type=audio> as an RFC-1867-style MIME
upload of audio microphone input is the emergence of highly
useful educational software utilizing the student's spoken
input for question answering, pronunciation scoring and
other language-learning applications. For example, Dr.
Jack Mostow of Carnegie Mellon states, "The pilot test of
Project LISTEN's Tutor at Fort Pitt Elementary School
surpassed our wildest hopes ... [In] pre- and post-tests
administered by school reading specialists, six third
graders who started almost three years below grade level
advanced their reading skills an average of 2.3 years in
under eight months." For more info: http://www.cs.cmu.edu/~listen/
The <input type=audio> extension had been proposed to source
input from a microphone into an audio buffer for RFC-1867-style
uploading as per <input type=file accept="audio/*"> with the
default MIME type "audio/basic". Someone else proposed the
less useful <input type=scribble>, a two-dimensional analogue
which would allow the uploading of an arbitrary user-generated
bitmap. I was under the impression that at least the former
had been officially proposed within a year ago. Are there any
objections to complete adoption as a standard?
As an aside from personal experience with consonant recognition,
I should point out that the MIME type "audio/basic" is
insufficient for most speech recognition application unless
advanced noise-reduction pre-processing is applied. In general,
please make sure that all the audio signals used for phonology
have a digital input of at least 16 bits of voxel (or D-to-A
converter) resolution at 11,025 samples/sec, preferably 22,050.
It is important to use as much of the available AC range as
possible (perhaps using microphone preamplifiers) without
introducing more noise than automatic gain control algorithms
would. A good source for these issues is: _Objective Measures
of Speech Quality_, by S. R. Quackenbush, T. P. Barnwell, and
M. A. Clements (Englewood Cliffs, N.J.: Prentice Hall, 1988).
Sincerely,
:James Salsman
P.S., This formula is useful for phonological coding, but
I have not yet registered a MIME type for it:
CDFT( ln( (|DFT(frame)| * mel_scale_spectrum) +1e-24 )
/
ln( mean(|DFT(frame)| * mel_scale_spectrum) +1e-24 )
)
; polar complex speech cepstrum with power and voicing
-->
{ spectral mean power as 8-bit log (e.g N where mu = 1.06**(-N) ),
spectral variance power (or standard deviation)
as 8-bit fixed-point normal (or standard) score,
<quefrency: gamnitude, saphe>
| quefrency in 1..N or {1, 2, 3-4, 5-8, 9-16 ...} or other partial CFFT
}
where
DFT = Fast (Discrete) Fourier Transform,
|DFT()|: FFT with real output (length: 1024 or 512),
CDFT: complex FFT (length: 512 or 256, respectivly);
frame = trapezoidal or Hamming window on observed speech signal,
monophonic 16-bit waveform sampled at 11025.Hz,
overlapping once (i.e., dual coverage);
mel_scale_spectrum[f:Hz] = ( 1.65 - 5.4e-6*f*ln(f) - 4.39/ln(f) ) * C;
C positive scalar.
Sincere regards,
:James Salsman
Received on Monday, 4 August 1997 22:40:42 UTC