- From: James Salsman <james@bovik.org>
- Date: Mon, 4 Aug 1997 19:40:15 -0700 (PDT)
- To: www-html@w3.org
The most important reason that I support the HTML forms extension <input type=audio> as an RFC-1867-style MIME upload of audio microphone input is the emergence of highly useful educational software utilizing the student's spoken input for question answering, pronunciation scoring and other language-learning applications. For example, Dr. Jack Mostow of Carnegie Mellon states, "The pilot test of Project LISTEN's Tutor at Fort Pitt Elementary School surpassed our wildest hopes ... [In] pre- and post-tests administered by school reading specialists, six third graders who started almost three years below grade level advanced their reading skills an average of 2.3 years in under eight months." For more info: http://www.cs.cmu.edu/~listen/ The <input type=audio> extension had been proposed to source input from a microphone into an audio buffer for RFC-1867-style uploading as per <input type=file accept="audio/*"> with the default MIME type "audio/basic". Someone else proposed the less useful <input type=scribble>, a two-dimensional analogue which would allow the uploading of an arbitrary user-generated bitmap. I was under the impression that at least the former had been officially proposed within a year ago. Are there any objections to complete adoption as a standard? As an aside from personal experience with consonant recognition, I should point out that the MIME type "audio/basic" is insufficient for most speech recognition application unless advanced noise-reduction pre-processing is applied. In general, please make sure that all the audio signals used for phonology have a digital input of at least 16 bits of voxel (or D-to-A converter) resolution at 11,025 samples/sec, preferably 22,050. It is important to use as much of the available AC range as possible (perhaps using microphone preamplifiers) without introducing more noise than automatic gain control algorithms would. A good source for these issues is: _Objective Measures of Speech Quality_, by S. R. Quackenbush, T. P. Barnwell, and M. A. Clements (Englewood Cliffs, N.J.: Prentice Hall, 1988). Sincerely, :James Salsman P.S., This formula is useful for phonological coding, but I have not yet registered a MIME type for it: CDFT( ln( (|DFT(frame)| * mel_scale_spectrum) +1e-24 ) / ln( mean(|DFT(frame)| * mel_scale_spectrum) +1e-24 ) ) ; polar complex speech cepstrum with power and voicing --> { spectral mean power as 8-bit log (e.g N where mu = 1.06**(-N) ), spectral variance power (or standard deviation) as 8-bit fixed-point normal (or standard) score, <quefrency: gamnitude, saphe> | quefrency in 1..N or {1, 2, 3-4, 5-8, 9-16 ...} or other partial CFFT } where DFT = Fast (Discrete) Fourier Transform, |DFT()|: FFT with real output (length: 1024 or 512), CDFT: complex FFT (length: 512 or 256, respectivly); frame = trapezoidal or Hamming window on observed speech signal, monophonic 16-bit waveform sampled at 11025.Hz, overlapping once (i.e., dual coverage); mel_scale_spectrum[f:Hz] = ( 1.65 - 5.4e-6*f*ln(f) - 4.39/ln(f) ) * C; C positive scalar. Sincere regards, :James Salsman
Received on Monday, 4 August 1997 22:40:42 UTC