- From: Amelia A Lewis <alewis@tibco.com>
- Date: Tue, 3 Feb 2009 13:25:25 -0500
- To: SOAP/JMS (list) <public-soap-jms@w3.org>
Heyo. This is to provide a partial catalog, noting some issues. In MIME, encoding is specified via charset parameters to the Content-Type header (for text types), or via some approximately-similar mechanism for subtypes of other types which contain something text-like. The default encoding, if none is specified, is US-ASCII; all headers contain only ASCII (or embedded base64 encoded non-ASCII the interpretation of which is open to significant question, though common practice uses the encoding of the body). MIME defaults to 7bit; to use anything else, you must explicitly specify "8bit", "binary", and so on (these interact with "quoted-printable" and "base64" in ways not always entirely intuitive). Multipart messages cannot specify an encoding; it's not a reasonable thing to do. A multipart message is a container; it contains as visible bits the boundaries and the headers of contained parts, which in MIME is all expressible in 7bit ASCII. In HTTP pseudo-MIME, the default if unspecified is ISO-8859-1; the encoding explicitly applies to the content of headers (so you can't parse the headers until you read the headers) (note that it's possible that this has changed since I last forced myself to read the specification in detail; it would be nice if so). 7bit (MIME default) is not permitted in HTTP; neither is quoted-printable encoding; HTTP doesn't understand 8bit encoding. It's binary or nothing, baby, but this tends to be obscured by the fact that the Content-Transfer-Encoding header is forbidden, and the defaults for HTTP are "binary ISO-8859-1". As with MIME, multipart messages cannot specify an encoding, although the situation here is slightly more complex (you can crash a lot of programs by supplying them with high-bit characters in the boundary string, for instance), since both boundaries and headers of contained parts can, in theory, contain characters outside the ASCII range. Neither MIME nor HTTP recognize a BOM in the body of a message (single part or multipart) as an encoding indicator (such a BOM might be honored, ignored, or the cause of an error, consequently). Now ... XML has two mechanisms to specify encoding: if the first bytes of a file are a unicode BOM (for UTF16 traditionally, but some parsers also recognize the UTF8 BOM), that's what it is. If there's no BOM, then the XML specification provides an algorithm for reading the XML declaration to find the content of the "encoding" pseudo-parameter. If there is no encoding pseudo-parameter, then the default is UTF-8. So, for MIME or HTTP: if the header (of a single-part message or a contained part) specifies an encoding, then (if you're a protocol-first believer) that's what the encoding is. If the part in question contains XML, the the BOM or encoding pseudo-parameter control (if you're a self-contained XML believer), even when the protocol specifies otherwise (true believers use UTF-8 when there is no encoding pseudo-parameter, regardless of any explicit specification in the protocol headers). Note the above paragraph: one of the big unsolved controversies is how this is treated. Chances are good that there are "proponents" of both teams in our working group (proponents in scare quotes because they may not care personally, but may represent companies whose products have adopted conflicting solutions to this problem). Adding SOAP brings only one wrinkle: SOAP 1.1 is encoded as text/xml (improperly) and (I think) doesn't permit an XML decl. SOAP 1.2 is application/xml or application/soap+xml, and the encoding is specified by the XML decl or BOM. That more or less covers MIME and HTTP. Now, let's add JMS to the mix. Twice. First, using BytesMessage, all of the above holds for JMS messages. A header (custom JMS Property) can specify the encoding (for single-part messages), and standard MIME headers can do so for contained parts. The same conflict with BOM/encoding pseudo-parameter in XML remains (and should probably be left just as unresolved here as it is in HTTP). JMS *should* pattern itself after HTTP pseudo-MIME, in particular using the same defaults (not because it's a good idea, but because we're doing SOAP, so we need to use as many bad ideas from HTTP as we possibly can manage). Okay? However, using TextMessage, all of the above is piffle. Stuff in as many encoding declarations as you want, in as many places as you want, it's defined by JMS to return a java.lang.String, so it's *already UTF16* and if you try to muck about with it, fish on you. Note that sending Base64 via TextMessage (even if an implementation has a compact representation on the wire) is potentially a memory-killer on the receiving end (n bytes/3 * 4 is standard, but in this case you get n bytes/3 * 4 * 2). Consider this the "TextMessage encoding exception", or something similar: if you're using the TextMessage API, you really oughta be disregarding any remaining indicators of how this UTF16 string *used* to be encoded. Amy! -- Amelia A. Lewis Senior Architect TIBCO/Extensibility, Inc. alewis@tibco.com
Received on Tuesday, 3 February 2009 18:26:11 UTC