Charsets, encodings, http-request, unescape-markup, and convenience, oh my! from Norman Walsh on 2011-10-06 (xproc-dev@w3.org from October 2011)

From: Norman Walsh <ndw@nwalsh.com>
Date: Thu, 06 Oct 2011 15:44:11 -0400
To: XProc Dev <xproc-dev@w3.org>
Message-ID: <m2wrch52ic.fsf@nwalsh.com>
Hello world,

Consider the following pipeline:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="1.0"
                xmlns:c="http://www.w3.org/ns/xproc-step"
                xmlns:l="http://xproc.org/library">
<p:output port="result"/>

<p:http-request>
  <p:input port="source">
    <p:inline>
      <c:request method="get" href="http://tests.xproc.org/tests/doc/html-utf8.data"/>
    </p:inline>
  </p:input>
</p:http-request>

</p:declare-step>

It returns a base64 encoded document:

<c:body xmlns:c="http://www.w3.org/ns/xproc-step"
content-type="application/octet-stream"
encoding="base64">PCFET0NUWVBFIGh0bWw+CjxodG1sIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hodG1s
Ij4KPGhlYWQ+Cjx0aXRsZT5QYWdlIFRpdGxlPC90aXRsZT4KPC9oZWFkPgo8Ym9keT4KPHA+UGFn
ZSBjb250ZW50LjwvcD4KPC9ib2R5Pgo8L2h0bWw+Cg==
</c:body>

Suppose I amend the pipeline as follows:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="1.0"
                xmlns:c="http://www.w3.org/ns/xproc-step"
                xmlns:l="http://xproc.org/library">
<p:output port="result"/>

<p:http-request>
  <p:input port="source">
    <p:inline>
      <c:request method="get" href="http://tests.xproc.org/tests/doc/html-utf8.data"/>
    </p:inline>
  </p:input>
</p:http-request>

<p:wrap wrapper="c:request" match="/"/>
<p:add-attribute match="/c:request" attribute-name="href"
                 attribute-value="http://validator.nu/?out=xml"/>
<p:add-attribute match="/c:request" attribute-name="method" attribute-value="post"/>

<p:http-request/>

</p:declare-step>

What should happen?

I think the answer is that the body should be unencoded before it's
sent to the server. That's not what XML Calabash (0.9.36) does, but I
think that's a bug. Agreed?

Does it strike you as odd that there's no charset attribute on c:body?

Now consider this pipeline:

<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="1.0"
                xmlns:c="http://www.w3.org/ns/xproc-step"
                xmlns:l="http://xproc.org/library">
<p:output port="result"/>

<p:http-request>
  <p:input port="source">
    <p:inline>
      <c:request method="get" href="http://tests.xproc.org/tests/doc/html-utf8.data"/>
    </p:inline>
  </p:input>
</p:http-request>

<p:unescape-markup/>

</p:declare-step>

What do you think it produces?

XML Calabash produces a copy of the input. It doesn't decode the data
because we didn't tell the step the encoding:

<p:unescape-markup encoding="base64"/>

Now you think it's going to do the right thing, but it doesn't because we
didn't specify a charset. This is a damn shame because I haven't a clue what
the charset is. But let's muddle on.

<p:unescape-markup encoding="base64" charset="utf-8"/>

This fails too because it tries to use an XML parser. I'm not sure if
that's a bug or not. I think I could try an HTML parser for
application/octet-stream and still be conformant.

Finally, this works:

<p:unescape-markup content-type="text/html" encoding="base64" charset="utf-8"/>

But it sure seems like it's making me work awfully hard. Especially if
you consider that I'd need a choose to select the encoding or not
attribute as encoding="" would not work.

I think...

1. c:body should be allowed to have a charset parameter
2. If the charset parameter isn't known/specified, we default to...ISO Latin 1, or
   whatever the Internet tells us the default is for text/* documents that don't
   specify a charset.
3. If the input to p:unescape-markup is a c:body element then we should use the
   content-type, encoding, and charset attributes from that element if they aren't
   specified on the step.

Thoughts?

                                        Be seeing you,
                                          norm

-- 
Norman Walsh
Lead Engineer
MarkLogic Corporation
Phone: +1 413 624 6676
www.marklogic.com
Received on Thursday, 6 October 2011 19:44:42 UTC