<input type=audio>

James Salsman (james@bovik.org)
Mon, 4 Aug 1997 19:40:15 -0700 (PDT)


Date: Mon, 4 Aug 1997 19:40:15 -0700 (PDT)
Message-Id: <199708050240.TAA09293@proxy3.ba.best.com>
To: www-html@w3.org
From: james@bovik.org (James Salsman)
Subject: <input type=audio>

The most important reason that I support the HTML forms 
extension <input type=audio> as an RFC-1867-style MIME 
upload of audio microphone input is the emergence of highly 
useful educational software utilizing the student's spoken 
input for question answering, pronunciation scoring and 
other language-learning applications.  For example, Dr. 
Jack Mostow of Carnegie Mellon states, "The pilot test of 
Project LISTEN's Tutor at Fort Pitt Elementary School 
surpassed our wildest hopes ... [In] pre- and post-tests 
administered by school reading specialists, six third 
graders who started almost three years below grade level 
advanced their reading skills an average of 2.3 years in 
under eight months."  For more info:  http://www.cs.cmu.edu/~listen/ 

The <input type=audio> extension had been proposed to source 
input from a microphone into an audio buffer for RFC-1867-style 
uploading as per <input type=file accept="audio/*"> with the 
default MIME type "audio/basic".  Someone else proposed the 
less useful <input type=scribble>, a two-dimensional analogue 
which would allow the uploading of an arbitrary user-generated 
bitmap.  I was under the impression that at least the former 
had been officially proposed within a year ago.  Are there any 
objections to complete adoption as a standard?

As an aside from personal experience with consonant recognition, 
I should point out that the MIME type "audio/basic" is 
insufficient for most speech recognition application unless 
advanced noise-reduction pre-processing is applied.  In general, 
please make sure that all the audio signals used for phonology 
have a digital input of at least 16 bits of voxel (or D-to-A 
converter) resolution at 11,025 samples/sec, preferably 22,050.  
It is important to use as much of the available AC range as 
possible (perhaps using microphone preamplifiers) without 
introducing more noise than automatic gain control algorithms 
would.  A good source for these issues is:  _Objective Measures 
of Speech Quality_, by S. R. Quackenbush, T. P. Barnwell, and 
M. A. Clements (Englewood Cliffs, N.J.: Prentice Hall, 1988).

Sincerely,
:James Salsman

P.S., This formula is useful for phonological coding, but 
I have not yet registered a MIME type for it:

CDFT( ln( (|DFT(frame)| * mel_scale_spectrum) +1e-24 ) 
      / 
      ln( mean(|DFT(frame)| * mel_scale_spectrum) +1e-24 ) 
) 
 ; polar complex speech cepstrum with power and voicing 
--> 
 { spectral mean power as 8-bit log (e.g N where mu = 1.06**(-N) ), 
   spectral variance power (or standard deviation) 
     as 8-bit fixed-point normal (or standard) score, 
  <quefrency: gamnitude, saphe> 
  | quefrency in 1..N or {1, 2, 3-4, 5-8, 9-16 ...} or other partial CFFT 
} 
where 
 DFT = Fast (Discrete) Fourier Transform, 
     |DFT()|: FFT with real output (length:  1024 or 512), 
     CDFT: complex FFT (length:  512 or 256, respectivly); 
 frame = trapezoidal or Hamming window on observed speech signal, 
     monophonic 16-bit waveform sampled at 11025.Hz, 
     overlapping once (i.e., dual coverage); 
 mel_scale_spectrum[f:Hz] = ( 1.65 - 5.4e-6*f*ln(f) - 4.39/ln(f) ) * C; 
     C positive scalar. 

Sincere regards, 
:James Salsman