W3C home > Mailing lists > Public > public-xml-processing-model-wg@w3.org > September 2012

RE: Supporting non-XML data in XProc

From: Toman, Vojtech <vojtech.toman@emc.com>
Date: Mon, 17 Sep 2012 04:14:40 -0400
To: "public-xml-processing-model-wg@w3.org" <public-xml-processing-model-wg@w3.org>
Message-ID: <F3C7EBECE80AC346BE4D1C5A9BB4A41F2EE71133A2@MX11A.corp.emc.com>
> -----Original Message-----
> From: Norman Walsh [mailto:ndw@nwalsh.com]
> Sent: Friday, September 14, 2012 8:26 PM
> To: public-xml-processing-model-wg@w3.org
> Subject: Re: Supporting non-XML data in XProc
> "Toman, Vojtech" <vojtech.toman@emc.com> writes:
> > 4. To allow for non-XML support in XPath, we will introduce a number
> of XDM
> >    extensions:
> Do we have to make them extensions, or can this be an implementation
> detail?
> I can imagine, for example, having my own DocumentSuperNode type that
> is passed between steps. It wraps either an XdmNode in the case of XML
> or a BinaryNode in the case of non-XML.

It depends on how formal we want to be. I wanted XPath to work transparently on binary nodes, so I modeled binary data based on XDM. But maybe I went too far and it can be formulated more simply.

> >    An important aspect of the above algorithm is that it applies not
> only to the
> >    input ports, but also to the output ports: before the data appears
> on an
> >    output port, it is converted to the appropriate media type.
> Am I right that this is only an issue for compound steps? If I write an
> (atomic) extension step that asserts it produces application/xml and at
> runtime it actually produces image/jpeg, is that an "error"
> that the XProc processor is supposed to detect and correct?

"Yes" to your first question, "I don't know" to the second one. In my current implementation, you would get an error if the media type doesn't match.

We could also get rid of the content-type attribute on p:output altogether and say that the steps produce what they produce, period.

> >    The media type conversion applies only to the p:input and p:output
> >    elements. It does not take place when the XProc processor
> processes the
> >    p:with-option, p:with-param, and p:variable elements, nor the
> >    p:xpath-context, p:iteration-source, and p:viewport-source
> elements. It also
> >    does not apply when the XProc processor evaluates the test
> expressions of
> >    p:choose/p:when elements. In these cases, the XPath expressions
> use the
> >    original data as the context item.
> How tricky is that? From an XPath expression, is a binary document just
> an empty document node? Does "/foo" return false, "count(//foo)"
> return 0, etc.? What does string-length() return?

This is covered by: "All other accessors defined on XDM nodes return the empty sequence for the binary node."
The only accessors that return anything other than an empty sequence are dm:base-uri and the proposed xpdm:content-type.

In other words, the only useful operations on binary nodes are querying the base URI and the content type. I could also imagine accessing the raw octet stream (as a sequence of xs:integer or something like that) using some kind of a xpdm:octet-stream accessor (and accompanying XPath functions).

> >    The kinds of mappings between different media types the XProc
> processor
> >    supports is left implementation-defined.
> >
> >    [[Note: I think that to make the "shimming" feature interoperable
> and
> >    actually useful at all, it should not be left too implementation-
> defined. I
> >    think we would have to define a bunch of mappings between common
> media types
> >    that the users can rely on. The downside of this is that this
> might be quite
> >    hard and it might shift the focus of this specification into a
> whole
> >    different direction.
> How many different mappings does your implementation support today?

Currently only application/xquery -> application/xml (by wrapping the query string in a c:query element).

For my XML Prague demo, I supported also application/json -> application/xml (one way) by using what JSONLib does by default:

{"prop": "value"}

becomes XML:
  <prop type="string">value</prop>

{"prop1": [{"prop2": "value"}]}

becomes XML:
  <prop1 class="array">
    <e class="object">
      <prop2 type="string">value</prop2>

  {"prop1": "value",
   "prop2": 100,
   "prop3": false,
   "prop4": null}

becomes XML:
  <e class="object">
    <prop1 type="string">value</prop1>
    <prop2 type="number">100</prop2>
    <prop3 type="boolean">false</prop3>
    <prop4 class="object" null="true"/>

etc. But this is exactly where I feel a bit uncomfortable. On one hand JSON->XML (and XML->JSON) is probably what we need most, but on the other hand it might be hard to agree on a specific mapping that would work for most users and use cases.

I also supported image/* -> image/*, but that was more for fun and went probably too far. See my XMl Prague paper for some examples.

> I wonder if we can mitigate the interop problem by having a mechanism
> for the pipeline to declare what mappings it needs? At least then a
> processor can reject a pipeline statically with a reasonable error
> message: "error: pipeline requires unsupported image/png to text/plain
> conversion."

Yes, I was thinking about that too. That way you would get at least some control of what happens behind the scenes. We could represent the mappings by QNames using a mechanism similar to "method" in XML serialization. The complexity lies in that it has to capture not only the source and target media types, but also what the mapping looks like exactly.

> >    A radical approach might be not to support shimming at all and
> simply say
> >    that if data of incompatible media type arrives, you get a dynamic
> >    error. Conversion between different media types can be left to
> >    special-purpose custom (or standardized?) steps.]]
> That seems less user-friendly.

It indeed does, but some low hanging fruit hangs lower than others.


Vojtech Toman
Consultant Software Engineer
EMC | Information Intelligence Group
Received on Monday, 17 September 2012 08:15:26 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:32:51 UTC