- From: Norman Walsh <ndw@nwalsh.com>
- Date: Thu, 06 Oct 2011 15:45:36 -0400
- To: public-xml-processing-model-comments@w3.org
- Message-ID: <m27h4h52fz.fsf@nwalsh.com>
Hello world,
Consider the following pipeline:
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="1.0"
xmlns:c="http://www.w3.org/ns/xproc-step"
xmlns:l="http://xproc.org/library">
<p:output port="result"/>
<p:http-request>
<p:input port="source">
<p:inline>
<c:request method="get" href="http://tests.xproc.org/tests/doc/html-utf8.data"/>
</p:inline>
</p:input>
</p:http-request>
</p:declare-step>
It returns a base64 encoded document:
<c:body xmlns:c="http://www.w3.org/ns/xproc-step"
content-type="application/octet-stream"
encoding="base64">PCFET0NUWVBFIGh0bWw+CjxodG1sIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5L3hodG1s
Ij4KPGhlYWQ+Cjx0aXRsZT5QYWdlIFRpdGxlPC90aXRsZT4KPC9oZWFkPgo8Ym9keT4KPHA+UGFn
ZSBjb250ZW50LjwvcD4KPC9ib2R5Pgo8L2h0bWw+Cg==
</c:body>
Suppose I amend the pipeline as follows:
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="1.0"
xmlns:c="http://www.w3.org/ns/xproc-step"
xmlns:l="http://xproc.org/library">
<p:output port="result"/>
<p:http-request>
<p:input port="source">
<p:inline>
<c:request method="get" href="http://tests.xproc.org/tests/doc/html-utf8.data"/>
</p:inline>
</p:input>
</p:http-request>
<p:wrap wrapper="c:request" match="/"/>
<p:add-attribute match="/c:request" attribute-name="href"
attribute-value="http://validator.nu/?out=xml"/>
<p:add-attribute match="/c:request" attribute-name="method" attribute-value="post"/>
<p:http-request/>
</p:declare-step>
What should happen?
I think the answer is that the body should be unencoded before it's
sent to the server. That's not what XML Calabash (0.9.36) does, but I
think that's a bug. Agreed?
Does it strike you as odd that there's no charset attribute on c:body?
Now consider this pipeline:
<p:declare-step xmlns:p="http://www.w3.org/ns/xproc" version="1.0"
xmlns:c="http://www.w3.org/ns/xproc-step"
xmlns:l="http://xproc.org/library">
<p:output port="result"/>
<p:http-request>
<p:input port="source">
<p:inline>
<c:request method="get" href="http://tests.xproc.org/tests/doc/html-utf8.data"/>
</p:inline>
</p:input>
</p:http-request>
<p:unescape-markup/>
</p:declare-step>
What do you think it produces?
XML Calabash produces a copy of the input. It doesn't decode the data
because we didn't tell the step the encoding:
<p:unescape-markup encoding="base64"/>
Now you think it's going to do the right thing, but it doesn't because we
didn't specify a charset. This is a damn shame because I haven't a clue what
the charset is. But let's muddle on.
<p:unescape-markup encoding="base64" charset="utf-8"/>
This fails too because it tries to use an XML parser. I'm not sure if
that's a bug or not. I think I could try an HTML parser for
application/octet-stream and still be conformant.
Finally, this works:
<p:unescape-markup content-type="text/html" encoding="base64" charset="utf-8"/>
But it sure seems like it's making me work awfully hard. Especially if
you consider that I'd need a choose to select the encoding or not
attribute as encoding="" would not work.
I think...
1. c:body should be allowed to have a charset parameter
2. If the charset parameter isn't known/specified, we default to...ISO Latin 1, or
whatever the Internet tells us the default is for text/* documents that don't
specify a charset.
3. If the input to p:unescape-markup is a c:body element then we should use the
content-type, encoding, and charset attributes from that element if they aren't
specified on the step.
Thoughts?
Be seeing you,
norm
--
Norman Walsh
Lead Engineer
MarkLogic Corporation
Phone: +1 413 624 6676
www.marklogic.com
Received on Thursday, 6 October 2011 19:46:06 UTC