RE: Charsets, encodings, http-request, unescape-markup, and convenience, oh my! from vojtech.toman@emc.com on 2011-10-10 (public-xml-processing-model-comments@w3.org from October 2011)

From: <vojtech.toman@emc.com>
Date: Mon, 10 Oct 2011 04:20:04 -0400
To: <public-xml-processing-model-comments@w3.org>
Message-ID: <3799D0FD120AD940B731A37E36DAF3FE33DAE9F774@MX20A.corp.emc.com>
> I think the answer is that the body should be unencoded before it's
> sent to the server. That's not what XML Calabash (0.9.36) does, but I
> think that's a bug. Agreed?

Calumet used to base64-decode the body before sending it to the server, but for some reason (which I don't remember anymore), I disabled this some time ago. Having said that, when I look at the spec (7.1.10.2 Request entity body conversion), it states the following about c:body:

"The encoding attribute controls the decoding of the element content for formulating the body. A value of base64 indicates the element's content is a base64 encoded string whose byte stream should be sent as the message body."

I must admit that I don't know how to read it now (and the comments in Calumet source code suggest I was never really sure). Does it say that you are supposed to take the base64 encoded string "as is", and send its byte sequence over the wire. Or does it say that you take the base64 encoded string, decode it first, and send the bytes of the decoded result? Or... does it matter at all, as long the Content-Transfer-Encoding header is set correctly?

Not decoding the data and relying on the Content-Transfer-Encoding header makes the assumption that the server can handle this.

> 
> Does it strike you as odd that there's no charset attribute on c:body?
> 

In retrospect, yes. I think the mistake we made is that we did not keep c:body and c:data compatible. By which I mean that where you can use c:data (as produced by p:data), c:body should work as well.

I believe that in places where you cannot use the "charset" attribute, you can still include the charset information in the content type, but that can make things rather tedious.

> Finally, this works:
> 
> <p:unescape-markup content-type="text/html" encoding="base64"
> charset="utf-8"/>
> 
> But it sure seems like it's making me work awfully hard. Especially if
> you consider that I'd need a choose to select the encoding or not
> attribute as encoding="" would not work.

Yes, but this is not unexpected given the specification of p:unescape-markup.

> 
> I think...
> 
> 1. c:body should be allowed to have a charset parameter

I think so, too, similar to c:data. However, that would require us to say something about the possible duplication of charset information in the content type and the "charset" attribute, both for the request and the response side.


> 2. If the
> charset parameter isn't known/specified, we default to...ISO Latin 1,
> or
>    whatever the Internet tells us the default is for text/* documents
> that don't
>    specify a charset.

I think so. You already get this behavior when you read text data, except that the applied default charset is not available anywhere in the constructed c:body.

> 3. If the input to p:unescape-markup is a c:body element then we should
> use the
>    content-type, encoding, and charset attributes from that element if
> they aren't
>    specified on the step.

I wouldn't make a special distinction between c:body and c:data. I think that ideally, the two should be interchangeable. In fact, I now think it would make most sense if we dropped one of the names and used only one of them everywhere...

(Note that with p:xquery, you have a similar problem if your query comes from an HTTP request and is thus wrapped in c:body. Out of the box, p:xquery cannot handle it directly, whereas if it were c:data, it could.)

Regarding p:unescape-markup, we could do a similar thing as in, for example, p:xquery. If the input is a c:data (c:body) element, we use its content-type, encoding, and charset attributes. If any of them is missing, we use the step's options as fallback if it is needed.

Similarly, if the input is not c:data (c:body), we can look for the c:content-type, c:encoding, and c:charset attributes.

To be really safe, we could introduce an optional option to p:unescape-markup to enable/disable this behavior.


Regards,
Vojtech

--
Vojtech Toman
Consultant Software Engineer
EMC | Information Intelligence Group
vojtech.toman@emc.com
http://developer.emc.com/xmltech
Received on Monday, 10 October 2011 08:20:50 UTC