Form-based Device Input and Upload in HTML

INTERNET-DRAFT                                               J. Salsman 
Suggested filename: <draft-www-device-upload-00.txt>     WebTV Networks 
Expiration date: 15 May 1998                           12 November 1997 
 
               Form-based Device Input and Upload in HTML 
 
Status of this Memo 
 
   This draft extends an experimental protocol for the Internet 
   community.  This draft does not specify an Internet standard of any 
   kind.  Discussion and suggestions for improvement are requested. 
   Distribution of this memo will be unlimited when the W3C approves the 
   HTML 4.0 standard.  Until then, please do not distribute this draft 
   beyond your department. 
 
1.  Abstract and introduction 
 
   Currently, HTML forms allow the producer of the form to request 
   information -- including files of data -- from the operator reading 
   the form.  However, this capability is limited because HTML forms 
   don't provide a way to ask the operator to submit input from 
   arbitrary sources such as audio devices like microphones.  Since 
   input and upload from various devices is a feature that will 
   benefit many applications, this draft proposes an extension to the 
   HTML INPUT TYPE=FILE form element specified in RFC 1867 to allow 
   information providers to express requests for uploads from audio 
   and other devices uniformly, and a discussion of MIME audio data 
   types to facilitate useful audio upload responses.  This draft also 
   includes security and audio usability and quality discussions as 
   well as a description of a backward compatibility strategy that 
   allows new user agents to utilize HTML written with earlier 
   proposals for audio input in mind, and concludes with motivations. 
 
     This document is an Internet-Draft.  Internet-Drafts are working
     documents of the Internet Engineering Task Force (IETF), its
     areas, and its working groups.  Note that other groups may also
     distribute working documents as Internet-Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-
     Drafts as reference material or to cite them other than as
     "work in progress."

     To view the entire list of current Internet-Drafts, please check
     the "1id-abstracts.txt" listing contained in the Internet-Drafts
     Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net
     (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East
     Coast), or ftp.isi.edu (US West Coast).
 
2.  HTML forms with device input file upload submission 
 
   Section 3.1 of RFC 1867 provides for the presentation of an 
   arbitrary "widget" to specify input for file uploads.  When an 
   INPUT tag of type FILE is encountered with a DEVICE attribute, the 
   associated value (such as MICROPHONE, or MIC) might select the use 
   of a widget capable of buffering and editing real-time input (such 
   as speech) instead of entering a file selection mode. 
 
   If an ACCEPT attribute is present in a device file input element, 
   the browser might constrain the MIME type of uploaded data to match 
   those with the corresponding list of types specified.  If the value 
   of the DEVICE parameter is FILESYSTEM or FILES then the INPUT 
   element might be treated as usual according to RFC 1867 except that 
   the subset of files presented to the operator to choose from may be 
   constrained by the specified list of MIME types instead of a 
   pattern of file names or extensions. 
 
   Since there is no original filename as specified in section 3.3 of 
   RFC 1867 for parameters of the 'content-disposition: form-data' and 
   'content-disposition: file' HTTP headers, those headers might be 
   provided with a 'type' parameter representing the MIME type of the 
   encoded data, if known, and a 'device' parameter with the same 
   value as the DEVICE attribute of the associated form input element, 
   unless the device or MIME type(s) specified are unsupported in 
   which case the value of the 'device' header parameter might be 
   'unsupported', or unless the device is unavailable in which case 
   the value might be 'unavailable'.  If the MIME types requested are 
   unsupported, an additional parameter 'alternates' might be included 
   with a space-separated list of MIME types of the same content-type 
   which may be supported as alternatives for the specified device. 
 
   There may be significant limitations on the client browser's 
   ability to buffer input for upload.  Browsers might provide an 
   estimate of the default MAXLENGTH available for device input and 
   upload through the HTTP header 'Pragma: DEVICE-MAXLENGTH='BYTES 
   which represents the content-length available to the browser for 
   buffering (see section 14.32 of RFC 2068.) 
 
   Furthermore, the VALUE attribute may be used to provide a 
   disambiguation between multiple similar devices when present. 
 
   If real time events, such as those described and proposed by 
   Gregory S. Aist in "A General Architecture for a Real-Time 
   Discourse Agent and a Case Study in Computerized Oral Reading 
   Tutoring" (Carnegie Mellon University Computational Linguistics 
   Program, 6 December 1996), are required, then the Real-time 
   Transport Protocol (RTP, currently RFC 1889) should be used 
   instead.  Because of security concerns discussed in section 3 
   below, HTML scripts might not be able to invoke a form submission 
   when the form involves any kind of file upload without explicit 
   instructions from the session operator to the contrary. 
 
2.1.  Examples 
 
     <FORM ENCTYPE="multipart/form-data" METHOD=POST ACTION="_URL_"> 
       Say something:  <INPUT NAME=SPEECH1 TYPE=FILE DEVICE=MIC> 
       <INPUT TYPE=SUBMIT VALUE="Send Speech"> 
     </FORM> 
 
   In this simple form, the HTML author has requested the upload of 
   sampled microphone input from the operator upon form submission. 
 
     <INPUT NAME=SPEECH2 TYPE=FILE DEVICE=MICROPHONE 
       ACCEPT="audio/l16 ;rate=11025 ;channels=1 audio/x-cepstral-voc"> 
 
   Here MIC is not used as an abbreviation.  The author of the HTML has 
   requested that the data input from the microphone be encoded as
either 
   the MIME type Audio/L16 -- sixteen bit signed linear audio samples 
   (most-significant byte first) -- as specified in RFC 1890 section 
   4.4.8, with a single (monaural) channel and a sample rate of 11,025 
   samples per second, or an unspecified extended MIME Audio type named 
   'x-cepstral-voc'. 
 
     <INPUT NAME=FILE1 TYPE=FILE DEVICE=FILES ACCEPT="text/*"> 
 
   Here the form element may be used to upload a file as usual, except 
   that the files to select from might be constrained to text files, 
   without explicit regard of their filename or extensions. 
 
     <INPUT NAME=PICTURE1 TYPE=FILE DEVICE=CAMERA VALUE=2> 
 
   The final example shows how these extensions may be used to request 
   input from other kinds of devices, such as the second of two or 
   more cameras connected to the system running the browser. 
 
3.  Security considerations 
 
   Browser operators may not want to send their files, recordings, 
   pictures, video, or other device inputs to arbitrary sites without 
   their explicit permission and direction.  Therefore, browser 
   authors are encouraged to disallow the submission of forms which 
   include any kind of file upload by any means other than the 
   standard HTML operator-controlled buttons for form submission 
   without explicit instruction from the session operator to the 
   contrary.  Accordingly, the SIZE parameter, document style sheets, 
   and document layers may be prevented from obscuring any kind of 
   file upload widget, especially those capable of accepting a default 
   filename.  Finally, just as the operator may take direct action to 
   initiate, terminate, review and edit recording as described in the 
   next section, browser authors are encouraged to prevent HTML 
   scripts from taking those and similar actions, unless for example 
   the operator has specifically enabled such script actions with a 
   security option.  Even then, such preferences might be specified by 
   the operator to reset after an interval or at the end of the 
   session.  Furthermore, explicit information might be provided to 
   insure that the operator is informed when files are being uploaded. 
 
4.  User interface usability and quality concerns for audio 
 
   An audio sample is customarily recorded on computer equipment with 
   a dialog routine capable of allowing the user to record, pause, 
   play back, erase, or otherwise edit the recording.  Browsers might 
   provide the operator with the same kind of dialog routine for audio 
   device input.  And if a MAXLENGTH has been specified or is in force 
   because of limited buffer size, a display of the buffer size used 
   and remaining might be displayed as a dynamic bar graph (or 
   percentage if graphics are unavailable.)  A display of time in 
   seconds used and remaining in the buffer may also be provided. 
 
   Most MIME types defined for audio do not provide high-quality audio 
   encodings.  The 'audio/basic' and other types which use a sample 
   rate of 8,000 samples per second truncate the audio spectrum at 
   4,000 Hz according to the Nyquist theorem, discarding information 
   important for discerning consonants.  Also, audio/basic and other 
   MIME Audio types use a sample size of eight bits, which does not 
   usually provide enough dynamic range for accurate automatic speech 
   recognition unless published automatic gain control algorithms are 
   reliably used.  If sixteen-bit unsigned audio encodings are used 
   according to section 4.4.8 of RFC 1890, the sample rate -- 
   specified as the 'rate' parameter of the MIME type 'audio/l16' -- 
   might be at least 11,025 or 16,000 to adequately provide sufficient 
   information for automatic speech recognition.  Otherwise, the audio 
   feature extraction encoding of the speech recognition algorithm 
   might be used to provide a more compact representation to shorten 
   the upload. 
 
5.  Compatibility with earlier forms of audio input 
 
   Audio device input has been proposed before and implemented from a 
   microphone at least as early as 1994 in experimental versions of 
   common Web browsers.  To accommodate the syntax of these earlier 
   extensions, a browser might interpret a valid XML statement such as 
 
     <INPUT TYPE=AUDIO ...> 
 
   as the device input form 
 
     <INPUT TYPE=FILE DEVICE=MICROPHONE ...> 
 
   with all other attribute/value pairs of the original INPUT element 
   kept the same as specified.  This would retain compatibility for 
   all implementations of which the author of this draft is aware. 
 
6.  HTML Document Type Description changes 
 
   Along with the extension to the HTML InputType entity described in 
   the previous section, this proposal makes an addition to the HTML 
   DTD for the INPUT element ATTLIST of an #IMPLIED attribute DEVICE 
   of type CDATA. 
 
7.  Motivations and conclusion 
 
   The primary motivation for these extensions is to add the 
   capability of speech input to Web-based educational systems.  For 
   example, the "Test of English as a Foreign Language," or TOEFL 
   assessment is comprised of multiple choice questions based on media 
   comprised of text and audio recordings, so it would be possible to 
   represent the TOEFL with current HTML multimedia content and forms. 
   However, the TOEFL makes no provision whatsoever about the accuracy 
   of pronunciation by the subjects of the assessment, except that 
   provided by the ability to accurately identify the terms in the 
   text of the assessment.  So while scoring on the important ability 
   to listen, the TOEFL does not make provisions to assess the 
   important ability to speak with correct pronunciation.  But with 
   form-based audio input and upload, and speech recognition servers 
   capable of aligning and scoring the pronunciation of words and 
   phonemes, such a Web-based TOEFL could be extended to reduce the 
   number of inscrutable graduate teaching assistants, for example. 
   Of course the possibilities for language instruction enabled by 
   these extensions are not limited to the graduate level or English. 
 
   Other motivations include the development of "dictation servers" 
   capable of transforming spoken audio uploaded though an HTTP 
   session to the corresponding text suitable for sending in email or 
   including in another document, for example.  Natural language 
   continuous speech recognition software conforming to standard APIs 
   for automatic dictation is as of this writing available from retail 
   outlets for less than US$90 so there is ample reason to believe 
   that dictation servers could soon become commonplace on the Web 
   with these extensions. 
 
   Finally, this could be a great help for hearing impaired people who 
   want to use a "phonology server" (similar to the server described 
   in the Web-TOEFL example above) to practice improving their 
   pronunciation without depending on a human speech coach. 
 
   The change to the HTML DTD is very simple, but very powerful.  It 
   enables a much greater variety of services to be implemented via 
   the World-Wide Web than is currently possible due to the lack of a 
   peripheral input upload submission facility.  This would be a very 
   valuable addition to the capabilities of the World-Wide Web. 
 
8.  Author's address and acknowledgments 
 
   James Salsman 
   Bovik Research (nonprofit research institute) 
   courtesy WebTV Networks, Microsoft Corporation 
   and MindSource Software Engineers 
 
   575 S. Rengstorff Avenue 
   Mountain View, CA  94040-1982 

   Email:  jps@bovik.org, jsalsman@corp.webtv.net  
   Phone:  (650) 938-1440 
 
   "TOEFL" and "Test Of English as a Foreign Language" are 
   registered trademarks of Educational Testing Service. 
 
References 
 
[RFC 1867] Form-based File Upload in HTML.  E. Nebel & L. Masinter, 
           November 1995.  ftp://ds.internic.net/rfc/rfc1867.txt 
 
[RFC 1889] RTP: A Transport Protocol for Real-Time Applications. 
           H. Schulzrinne, S. Casner, R. Frederick, & V. Jacobson, 
           January 1996.  ftp://ds.internic.net/rfc/rfc1889.txt 
 
[RFC 1890] RTP Profile for Audio and Video Conferences with Minimal 
           Control.  H. Schulzrinne, January 1996. 
             ftp://ds.internic.net/rfc/rfc1890.txt
 
[RFC 2068] Hypertext Transfer Protocol -- HTTP/1.1.  R. Fielding, 
           J. Gettys, J. Mogul, H. Frystyk, & T. Berners-Lee, 
           January 1997.  ftp://ds.internic.net/rfc/rfc2068.txt 
 
END OF INTERNET-DRAFT 
Suggested filename: <draft-www-device-upload-00.txt> 
Expiration date: 15 May 1998 
 
:jps

Received on Thursday, 20 November 1997 19:30:54 UTC