Charsets in HTML Forms

Modern web browsers encode HTML form data in the charset encoding of the form's charset attribute. If the form has no explicit charset attribute, browsers use the enclosing page's charset attribute instead. When users submit form data to servers, browsers provide that data in either an HTTP GET or POST command. In either case, the browser encodes the data before sending it to the server.

This page describes a potential problem between the interaction of encoded form data and the servers that interpret or decode that data. Solutions will also be provided.

Problem Description

One common problem is the misinterpretation of form data on the server side. When retrieving form data, servers sometimes decode form data using the incorrect charset encoding. When the server retrieves form data using the wrong charset encoding, the result is garbled, lost data.

Causes

Nothing in the encoded data itself communicates the charset encoding of that data. Server code that parses data often makes incorrect assumptions about encodings. For example, even though your browser sends UTF-8 encoded data, the server may mistakenly use the ISO-8859-1 charset to decode the data.

Problem Examples

From Richard:

Problem scenarios (in each case explain, using an example, how the characters are sent and the potential issues, and recommend best practices):

  1. the user types into the form field characters that are not in the encoding of the document

  2. some process adds characters to the form that are not in the encoding of the document

  3. the server doesn't know the encoding of the data it is receiving

  4. the server is receiving data in multiple encodings

  5. the user changes the encoding of the document manually before the form data is sent

Solutions

Describe a couple solutions:

  1. Provide charset encoding as a hidden form field that the server can read and apply to remaining fields

  2. Standardize on a single charset, UTF-8.