Notes for HTML techniques: forms from Martin Duerst on 2003-03-02 (public-i18n-geo@w3.org from March 2003)

From: Martin Duerst <duerst@w3.org>
Date: Sat, 01 Mar 2003 21:03:25 -0500
To: public-i18n-geo@w3.org
Message-ID: <1057787663.IAA22192@phantom.w3.org>
Hello Richard, others,

Here are some notes for "Dealing with character sets & encodings",
currently 13.4, but I think this should become 13.1. There are
many things that potentially may be added, and quite a few things
that we may want to check.

The structure should be more or less obvious from the indenting
and the subtitles. I have choosen to just create my own,
problem-oriented structure. Let's see how well this fits
with the overall structures we already have.

Regards,    Martin.


Character Encodings in Forms
============================

Background:

Making sure that the data that comes back from a Web form is
in a known encoding is extremely important for the correct
working of Web forms. Before 4th-generation user agents, there
was a lot of undefined and accidental behavior. As of 4th-generation
and later user agents, the rule is to send back form data in the
encoding that the document was interpreted by the browser.
This is not very well described in HTML4, it mainly appears in the
sentence "The default value for this attribute is the reserved
string "UNKNOWN". User agents may interpret this value as the
character encoding that was used to transmit the document
containing this FORM element." in the description of the
(rarely used and not widely implemented?) accept-charset attribute
(see http://www.w3.org/TR/REC-html40/interact/forms.html#adef-accept-charset).

??? say something about method='POST' (vs. GET), and
enctype='multipart/form-data' (vs. application/x-www-form-urlencoded).


The major alternatives are (choose one per site, or at least
per form handler (the URI in the action attribute of the <form>
element)):

1. Use UTF-8 throughout

    advantages:
        - Allows to use the same form handler for forms in many different 
languages.
        - Does not limit set of characters that can be input and transmitted.
        - Allows form request URIs to be displayed as IRIs.
        - Allows form to be converted to XForms (which requires UTF-8 for
          GET requests) without changing form handler.

    subtechnique: check that what you got back is really UTF-8
       Although extremely rare, there may still be accidentally
       very old user agents out there that mess things up. Also,
       for whatever reason, the user might accidentally have
       forced the user agent to use a different encoding.
       It is therefore helpful to check that the data that comes
       back in from the form is really UTF-8. This can be done
       by checking the byte pattern of the data.
       (background reading: The Properties and Promizes of UTF-8, M.J. Du"rst,
        11th International Unicode Conference -1- San Jose, September 1997,
        http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf)

       Actual implementation:
          Perl: See the regular expression in the subroutine
          check_utf8 in the W3C HTML/XML validator at e.g.
          http://dev.w3.org/cvsweb/validator/httpd/cgi-bin/check?rev=1.322&c 
http://dev.w3.org/cvsweb/validator/httpd/cgi-bin/check?rev=1.322&content-typ 
e=text/x-cvsweb-markup


2. Know the encoding you sent out

    If documents in different encodings are served, then it is important
    to know what encoding was served, because form results will (usually)
    come back in the same encoding. One way to do this is to use
    different form handlers for different encodings. But this may result
    in significant code duplication, and may be difficult to upgrade.
    Another way to do this is to include a hidden field containing the
    encoding. This field is sent back to the form processor without the
    user seeing it. (check: Microsoft Internet Explorer places the
    actual encoding used into the field if the field has the name
    ???? (check, provide reference, move to specific implementations))


Limitations: Both techniques described above rely on the document not being
transcoded when sent to the user agent. The exceptions know to this are:
1) Some servers in Russia transcode HTML documents before sending them out
depending on the operating system of the user agent. 2) Some gateways
(working as HTTP proxies) to user agents on mobile phones convert to a
single encoding known by the mobile phone. A variant of the 'hidden field'
technique can be used to trace code conversions on the way to the user
agent: Some well-known text string is placed in a hidden field. This
text string will be converted like the rest of the document, and will
be sent to the form processor. The form processor can the determine
the encoding of that text string by comparing it to transcoded versions
of the text string, and can interpret the other parameters sent to
the form as being in the same encoding.


Additional technique (applies in all cases):

   Make sure the character encoding of the form page is recognized correctly

    Because form results are sent back in the encoding of the document
    that contains the form, it is crucial
    that the user agent recognizes the encoding of the form page correctly.
    This can be assured in two ways (ideally use both):
    1. Follow the techniques given in section 3.2 for specifying
       the character encoding
    2. Include some appropriate non-ASCII text in the page so that the
       user is able to see whether the user agent recognized the
       character encoding of the forms page correctly.

Notes on specific implementations:
   Browser side:
     Microsoft Internet Explorer (V. 4-5.5, verify for 6) posts data as
     UTF-8 for all Unicode-based encodings (e.g. UTF-16), for both
     application/x-www-form-urlencoded and multipart/form-data
     (source: http://support.microsoft.com/default.aspx?scid=kb;en-us;303612)

   Server side:
     Java servlets (add versions here!) interpret incoming form data as 
iso-8859-1.
     The interpretation of the data has to be changed to UTF-8 (or whatever 
else
     it actually is). This can be done by interpreting the parameter as bytes
     based on iso-8859-1, and then recreating a String from the bytes based on
     UTF-8 (or whatever other encoding).
     Example:
     String field = new String 
(request.getParameter("fieldname").getBytes("iso-8859-1"), "utf-8");
     See e.g. also the 'process' method at
     http://dev.w3.org/cvsweb/java/classes/org/w3c/rdf/examples/ARPServlet.j 
http://dev.w3.org/cvsweb/java/classes/org/w3c/rdf/examples/ARPServlet.java?r 
ev=1.70&content-type=text/x-cvsweb-markup
Received on Saturday, 1 March 2003 21:06:21 UTC