Unicode and Legacy Encodings in SOAP Transactions.

SOAP transactions rely on being able to exchange data in a consistent, mutually understandable way. The character encoding of the SOAP message and the communication of the encoding between senders and receivers enable this to occur reliably. Because all XML [XML] processors must be able to read entities in both the UTF-8 [RFC2279] and UTF-16 [RFC2781] encodings, using UTF-8 or UTF-16 guarantees character encoding interoperability on the SOAP layer. The Character Model for the World Wide Web [CHARMOD] document describes these considerations and guidelines.

If you are using SOAP 1.1 and the Content-Type text/xml, then the charset parameter MUST be supplied in order to ensure correct interoperability, because the default for text/xml is us-ascii. If you are using SOAP 1.2, then the Content-Type signature is application/soap+xml. If the charset parameter for that media type is omitted using application/soap+xml then the SOAP document will be examined for encoding using the rules provided in XML. In all cases the charset parameter in the media type takes precedence over that of the XML that forms the SOAP document. Please refer to RFC3023, XML 1.0, and RFC2045/2046 for more information.

Scenario C: A SOAP Sender sends a legacy (non-Unicode) encoded request which the receiver doesn't support. The SOAP processor should fail and may return a fault.

Scenario D: A SOAP processor receives and processes a request and returns a result. The response is encoded uses a character encoding not supported by the original Sender. The Sender will not be able to process the response. This is an unrecoverable error. SOAP users should agree in advance on the collection of encodings that will be used in the transactions. Ideally all transactions will use a Unicode encoding, such as UTF-8, since all XML processors are required to handle this encoding.

Scenario E: Some encodings have more characters than are included in Unicode or use Private Use characters. SOAP messages sent using these problematic characters may result in transient failure or odd results. These characters should be avoided wherever possible or mutual agreement on the charset should be used.

Scenario F: Processor receives a SOAP message whose encoding declaration doesn't match its actual encoding. The processor should fail (according to the rules in RFC3023 and in XML) and may return a fault.

Scenario G: Processort receives and processes a SOAP message. The processor invokes an agent (the actual service), which uses a legacy encoding. Data may be lost or corrupted by the transcoding process between the receiving SOAP processor and the agent. The transaction may seem to succeed, even though the data is corrupted.

Example G: A Web service for "insert new record" is created for a relational database using Latin-1 as an encoding. The new record sent by the sender contains all kanji characters. The invocation of the service succeeds, even though all of the kanji characters are converted to the substitution character (generally a ?). The failure may not be detectable except by inspecting the resulting data.

Note that the XML Japanese Profile [XML-JP] describes that using legacy encodings such as Shift_JIS cannot provide complete interoperability ininformation interchange; there are differences among platforms in the mapping tables they use for this and similar encodings.