- From: Martin Duerst <duerst@w3.org>
- Date: Sat, 01 Mar 2003 21:03:25 -0500
- To: public-i18n-geo@w3.org
Hello Richard, others, Here are some notes for "Dealing with character sets & encodings", currently 13.4, but I think this should become 13.1. There are many things that potentially may be added, and quite a few things that we may want to check. The structure should be more or less obvious from the indenting and the subtitles. I have choosen to just create my own, problem-oriented structure. Let's see how well this fits with the overall structures we already have. Regards, Martin. Character Encodings in Forms ============================ Background: Making sure that the data that comes back from a Web form is in a known encoding is extremely important for the correct working of Web forms. Before 4th-generation user agents, there was a lot of undefined and accidental behavior. As of 4th-generation and later user agents, the rule is to send back form data in the encoding that the document was interpreted by the browser. This is not very well described in HTML4, it mainly appears in the sentence "The default value for this attribute is the reserved string "UNKNOWN". User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element." in the description of the (rarely used and not widely implemented?) accept-charset attribute (see http://www.w3.org/TR/REC-html40/interact/forms.html#adef-accept-charset). ??? say something about method='POST' (vs. GET), and enctype='multipart/form-data' (vs. application/x-www-form-urlencoded). The major alternatives are (choose one per site, or at least per form handler (the URI in the action attribute of the <form> element)): 1. Use UTF-8 throughout advantages: - Allows to use the same form handler for forms in many different languages. - Does not limit set of characters that can be input and transmitted. - Allows form request URIs to be displayed as IRIs. - Allows form to be converted to XForms (which requires UTF-8 for GET requests) without changing form handler. subtechnique: check that what you got back is really UTF-8 Although extremely rare, there may still be accidentally very old user agents out there that mess things up. Also, for whatever reason, the user might accidentally have forced the user agent to use a different encoding. It is therefore helpful to check that the data that comes back in from the form is really UTF-8. This can be done by checking the byte pattern of the data. (background reading: The Properties and Promizes of UTF-8, M.J. Du"rst, 11th International Unicode Conference -1- San Jose, September 1997, http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf) Actual implementation: Perl: See the regular expression in the subroutine check_utf8 in the W3C HTML/XML validator at e.g. http://dev.w3.org/cvsweb/validator/httpd/cgi-bin/check?rev=1.322&c http://dev.w3.org/cvsweb/validator/httpd/cgi-bin/check?rev=1.322&content-typ e=text/x-cvsweb-markup 2. Know the encoding you sent out If documents in different encodings are served, then it is important to know what encoding was served, because form results will (usually) come back in the same encoding. One way to do this is to use different form handlers for different encodings. But this may result in significant code duplication, and may be difficult to upgrade. Another way to do this is to include a hidden field containing the encoding. This field is sent back to the form processor without the user seeing it. (check: Microsoft Internet Explorer places the actual encoding used into the field if the field has the name ???? (check, provide reference, move to specific implementations)) Limitations: Both techniques described above rely on the document not being transcoded when sent to the user agent. The exceptions know to this are: 1) Some servers in Russia transcode HTML documents before sending them out depending on the operating system of the user agent. 2) Some gateways (working as HTTP proxies) to user agents on mobile phones convert to a single encoding known by the mobile phone. A variant of the 'hidden field' technique can be used to trace code conversions on the way to the user agent: Some well-known text string is placed in a hidden field. This text string will be converted like the rest of the document, and will be sent to the form processor. The form processor can the determine the encoding of that text string by comparing it to transcoded versions of the text string, and can interpret the other parameters sent to the form as being in the same encoding. Additional technique (applies in all cases): Make sure the character encoding of the form page is recognized correctly Because form results are sent back in the encoding of the document that contains the form, it is crucial that the user agent recognizes the encoding of the form page correctly. This can be assured in two ways (ideally use both): 1. Follow the techniques given in section 3.2 for specifying the character encoding 2. Include some appropriate non-ASCII text in the page so that the user is able to see whether the user agent recognized the character encoding of the forms page correctly. Notes on specific implementations: Browser side: Microsoft Internet Explorer (V. 4-5.5, verify for 6) posts data as UTF-8 for all Unicode-based encodings (e.g. UTF-16), for both application/x-www-form-urlencoded and multipart/form-data (source: http://support.microsoft.com/default.aspx?scid=kb;en-us;303612) Server side: Java servlets (add versions here!) interpret incoming form data as iso-8859-1. The interpretation of the data has to be changed to UTF-8 (or whatever else it actually is). This can be done by interpreting the parameter as bytes based on iso-8859-1, and then recreating a String from the bytes based on UTF-8 (or whatever other encoding). Example: String field = new String (request.getParameter("fieldname").getBytes("iso-8859-1"), "utf-8"); See e.g. also the 'process' method at http://dev.w3.org/cvsweb/java/classes/org/w3c/rdf/examples/ARPServlet.j http://dev.w3.org/cvsweb/java/classes/org/w3c/rdf/examples/ARPServlet.java?r ev=1.70&content-type=text/x-cvsweb-markup
Received on Saturday, 1 March 2003 21:06:21 UTC