some notes on sources of encoding information

Heyo.

This is to provide a partial catalog, noting some issues.

In MIME, encoding is specified via charset parameters to the 
Content-Type header (for text types), or via some approximately-similar 
mechanism for subtypes of other types which contain something text-like.

The default encoding, if none is specified, is US-ASCII; all headers 
contain only ASCII (or embedded base64 encoded non-ASCII the 
interpretation of which is open to significant question, though common 
practice uses the encoding of the body).

MIME defaults to 7bit; to use anything else, you must explicitly 
specify "8bit", "binary", and so on (these interact with 
"quoted-printable" and "base64" in ways not always entirely intuitive).

Multipart messages cannot specify an encoding; it's not a reasonable 
thing to do.  A multipart message is a container; it contains as 
visible bits the boundaries and the headers of contained parts, which 
in MIME is all expressible in 7bit ASCII.

In HTTP pseudo-MIME, the default if unspecified is ISO-8859-1; the 
encoding explicitly applies to the content of headers (so you can't 
parse the headers until you read the headers) (note that it's possible 
that this has changed since I last forced myself to read the 
specification in detail; it would be nice if so).  7bit (MIME default) 
is not permitted in HTTP; neither is quoted-printable encoding; HTTP 
doesn't understand 8bit encoding.  It's binary or nothing, baby, but 
this tends to be obscured by the fact that the 
Content-Transfer-Encoding header is forbidden, and the defaults for 
HTTP are "binary ISO-8859-1".

As with MIME, multipart messages cannot specify an encoding, although 
the situation here is slightly more complex (you can crash a lot of 
programs by supplying them with high-bit characters in the boundary 
string, for instance), since both boundaries and headers of contained 
parts can, in theory, contain characters outside the ASCII range.

Neither MIME nor HTTP recognize a BOM in the body of a message (single 
part or multipart) as an encoding indicator (such a BOM might be 
honored, ignored, or the cause of an error, consequently).

Now ... XML has two mechanisms to specify encoding: if the first bytes 
of a file are a unicode BOM (for UTF16 traditionally, but some parsers 
also recognize the UTF8 BOM), that's what it is.  If there's no BOM, 
then the XML specification provides an algorithm for reading the XML 
declaration to find the content of the "encoding" pseudo-parameter.  If 
there is no encoding pseudo-parameter, then the default is UTF-8.

So, for MIME or HTTP: if the header (of a single-part message or a 
contained part) specifies an encoding, then (if you're a protocol-first 
believer) that's what the encoding is.  If the part in question 
contains XML, the the BOM or encoding pseudo-parameter control (if 
you're a self-contained XML believer), even when the protocol specifies 
otherwise (true believers use UTF-8 when there is no encoding 
pseudo-parameter, regardless of any explicit specification in the 
protocol headers).

Note the above paragraph: one of the big unsolved controversies is how 
this is treated.  Chances are good that there are "proponents" of both 
teams in our working group (proponents in scare quotes because they may 
not care personally, but may represent companies whose products have 
adopted conflicting solutions to this problem).

Adding SOAP brings only one wrinkle: SOAP 1.1 is encoded as text/xml 
(improperly) and (I think) doesn't permit an XML decl.  SOAP 1.2 is 
application/xml or application/soap+xml, and the encoding is specified 
by the XML decl or BOM.

That more or less covers MIME and HTTP.  Now, let's add JMS to the 
mix.  Twice.

First, using BytesMessage, all of the above holds for JMS messages.  A 
header (custom JMS Property) can specify the encoding (for single-part 
messages), and standard MIME headers can do so for contained parts.  
The same conflict with BOM/encoding pseudo-parameter in XML remains 
(and should probably be left just as unresolved here as it is in 
HTTP).  JMS *should* pattern itself after HTTP pseudo-MIME, in 
particular using the same defaults (not because it's a good idea, but 
because we're doing SOAP, so we need to use as many bad ideas from HTTP 
as we possibly can manage).  Okay?

However, using TextMessage, all of the above is piffle.  Stuff in as 
many encoding declarations as you want, in as many places as you want, 
it's defined by JMS to return a java.lang.String, so it's *already 
UTF16* and if you try to muck about with it, fish on you.  Note that 
sending Base64 via TextMessage (even if an implementation has a compact 
representation on the wire) is potentially a memory-killer on the 
receiving end (n bytes/3 * 4 is standard, but in this case you get n 
bytes/3 * 4 * 2).  Consider this the "TextMessage encoding exception", 
or something similar: if you're using the TextMessage API, you really 
oughta be disregarding any remaining indicators of how this UTF16 
string *used* to be encoded.

Amy!
-- 
Amelia A. Lewis
Senior Architect
TIBCO/Extensibility, Inc.
alewis@tibco.com

Received on Tuesday, 3 February 2009 18:26:11 UTC