Supporting non-XML data in XProc

Hi all,

See below for the first proposal for non-XML data support in XProc. It is, with
some additional remarks here and there, more or less what I described in my
XML Prague 2012 paper:

http://www.xmlprague.cz/2012/files/xmlprague-2012-proceedings.pdf

(The paper also includes a couple of sample pipelines.)

---

1. Both XML and non-XML data can flow through the pipeline. XML data flows as
   XML Infoset instances, and non-XML data as "raw" octet streams. I will refer
   to the union of XML and non-XML data simply "data".

2. The data that flows through the pipeline is annotated with media type
   information (application/xml, image/svg+xml, text/plain, image/jpeg etc.).

3. XProc steps declare what media types they expect on their input ports and
   what media types they produce on their output ports. If data of an
   incompatible media type arrives on a port of a step, the XProc processor
   attempts to convert ("shim") the data to the appropriate media type using
   some well-defined algorithm - see below.

4. To allow for non-XML support in XPath, we will introduce a number of XDM
   extensions:

   - A new property on the document node: content-type, possibly empty

   - A new accessor defined on all kinds of XDM nodes:

     xpdm:content-type($n as node()) as xs:string?

     For the document node, the xpdm:content-type accessor returns the value of
     the content-type property. For the other types of nodes (element,
     attribute, text, namespace, processing instruction, and comment), it
     returns the value of the content-type property of the owner document.

   - A new binary data node to represent non-XML data. The binary data node has
     the following properties:

     - base-uri, possibly empty

     - content-type, possibly empty

     For the binary data node, the dm:base-uri accessor returns the value of the
     base-uri property, and the xpdm:content-type accessor returns the value of
     the content-type property. The dm:node-kind accessor returns the value
     "binary-data". All other accessors defined on XDM nodes return the empty
     sequence for the binary node.

     [[Note that it should be possible to expose the octet sequence of the
     binary data node by introducing a special property and an accessor
     (representing the octets for instance as a sequence of xs:unsignedByte or
     xs:integer). Not sure if want to go that route or not.]]

5. Shimming

   While evaluating a pipeline, the XProc processor performs the following
   algorithm when data appears on a port of a step:

   - If the port media type is a wildcard or if the data media type is the same
     as the port media type, the data appears on the port with no modifications;
     otherwise

   - if the XProc processor knows how to map (see the discussion below) from the
     data media type to the port media type, the data is converted to the port
     media type; otherwise

   - the XProc processor performs one of the following fall-back actions:

     - If both the data and the port media types are XML media types, the data
       appears on the port with no modifications.

     - If the port media type is application/xml, the data is processed as if it
       was read via the standard XProc p:data binding with a c:data wrapper
       element.

     - If both the data and the port media types are text media types, the data
       appears on the port with no modifications.

     - Any other combination of the data and the port media types results in a
       dynamic error.

   [[Note: Some aspects of the above algorithm, especially the fall-back
   behavior, may be questionable. This definitely needs some discussion.]]

   An important aspect of the above algorithm is that it applies not only to the
   input ports, but also to the output ports: before the data appears on an
   output port, it is converted to the appropriate media type. This leads to a
   number of interesting properties, especially in conjunction with compound
   steps - it is, for example, possible to create a p:for-each loop whose
   sub-pipeline produces data of all sorts of media types which are then
   "consolidated" into one media type as specified on the p:for-each's output
   port.

   The media type conversion applies only to the p:input and p:output
   elements. It does not take place when the XProc processor processes the
   p:with-option, p:with-param, and p:variable elements, nor the
   p:xpath-context, p:iteration-source, and p:viewport-source elements. It also
   does not apply when the XProc processor evaluates the test expressions of
   p:choose/p:when elements. In these cases, the XPath expressions use the
   original data as the context item.

   The kinds of mappings between different media types the XProc processor
   supports is left implementation-defined.

   [[Note: I think that to make the "shimming" feature interoperable and
   actually useful at all, it should not be left too implementation-defined. I
   think we would have to define a bunch of mappings between common media types
   that the users can rely on. The downside of this is that this might be quite
   hard and it might shift the focus of this specification into a whole
   different direction.

   A radical approach might be not to support shimming at all and simply say
   that if data of incompatible media type arrives, you get a dynamic
   error. Conversion between different media types can be left to
   special-purpose custom (or standardized?) steps.]]

6. Modifications to the XProc language

   - p:input, p:output

     Media type annotations can be added to input and output port declarations
     using the "content-type" attribute. The value of the content-type attribute
     is either an exact media type string (such as application/xml) or a
     wildcard, represented by the "*" character. If the content-type attribute
     is not specified on a port declaration, the media type application/xml is
     assumed.

     The declaration below declares a step that accepts XML data on the source
     input port and that produces PDF output on the result output port:

     <p:declare-step>
       <p:input port="source" content-type="application/xml"/>
       <p:output port="result" content-type="application/pdf"/>
       ...
     </p:declare-step>

     The following example declares a step that can process and produce data of
     any media type:

     <p:declare-step>
       <p:input port="source" content-type="*"/>
       <p:output port="result" content-type="*"/>
       ...
     </p:declare-step>

     The media-type attribute cannot be used on parameter input ports; the media
     type of parameter input ports is always application/xml.

   - p:data

     For the purpose of better supporting non-XML media types, the p:data
     binding can return raw, not wrapped, data. [[Note: Possibly the biggest
     breaking change to the language?]]

     Where previously the p:data binding always encoded and wrapped the resource
     referred to via the href attribute (the wrapper being either a
     custom-specified element or the default c:data element), the modified
     p:data only encodes and wraps the resource when the pipeline author
     requests an explicit wrapper using the wrapper attribute. If no wrapper
     element is specified, p:data returns the resource "as is".

     The semantics of the content-type attribute of p:data remains the same: if
     the resource comes with a media type annotation, that one must be used,
     otherwise the media type specified in the content-type attribute should be
     assumed. If no media type information can be associated with the resource,
     the media type application/octet-stream is assumed.

   - p:pipeline

     The p:pipeline shortcut supports input and output of any media type by
     default. It is equivalent to the following p:declare-step:

     <p:declare-step>
       <p:input port="source" primary="true" sequence="false" content-type="*"/>
       <p:input port="parameters" primary="true" kind="parameter"/>
       <p:input port="result" primary="true" sequence="false" content-type="*"/>
     </p:declare-step>

     This makes it possible to use p:pipeline to process both XML and non-XML
     data easily.

   - p:for-each

     The p:for-each step can be used to process data of any media type. The
     "current" implicit input port supports data of any media type. If the
     p:for-each step contains explicit p:output declarations, then, inside of
     the p:for-each, these output ports accept any media type regardless of the
     value of the content-type attribute. On the outside of p:for-each, however,
     the data appearing on the output port gets converted to the appropriate
     media type. By default, the implicit output port of p:for-each supports any
     media type.

   - p:choose

     The p:choose step can process data of any media type. The p:when branches
     must declare the same numbers of output ports with the same names -
     however, these output ports may specify different media types. By default,
     the implicit output ports of p:when (and p:choose) support any media type.

   - p:group

     The p:group wrapper can be used to process data of any media type. By
     default, the implicit output port of p:group supports any media type.

   - p:try

     The p:try step can be used to process data of any media type. The error
     input port of p:catch (the XML representation of the dynamic error) accepts
     data of the media type application/xml. The output ports of the p:group and
     p:catch sub-pipelines of p:try must specify the same numbers of output
     ports with the same names, but they may declare different media types. By
     default, the implicit output port of p:catch supports any media type.

   - p:data, p:document, p:inline, p:pipe

     The p:data, p:document, p:inline, and p:pipe bindings can specify an
     optional attribute "as-content-type" - for more details, see "Overriding
     media type information."

7. Modifications to the XProc standard step library

   Most atomic steps from the standard XProc step library are too XML-specific
   (for instance p:xinclude or p:validate-with-xml-schema) to be easily
   applicable to non-XML data. Having said that, however, there is a small
   number of steps that can be - in some cases to great benefit - adapted for
   non-XML data processing. This requires both modifying the implementations of
   the steps and changing their declarations in the standard step library by
   adding more relaxed media type annotations to their input and output
   ports. The list below summarizes the changes in more detail:

   - p:count

     The p:count step can be used to process any media type. The output format
     of p:count (a c:result document with the count) remains unchanged from the
     specification.

   - p:http-request

     The p:http-request can produce output of any media type. Very much similar
     to the p:data binding, the p:http-request step presents non-XML responses
     in their raw form, not base64-encoding nor wrapping them anymore. If the
     media type of the response data cannot be determined,
     application/octet-stream is assumed.

     The "detailed" response mode of p:http-request remains unchanged from the
     original specification, and so does handling of multipart responses.

   - p:identity

     The p:identity step can be used to process any media type.

   - p:sink

     The p:sink step can be used to process any media type.

   - p:split-sequence

     The p:split-sequence step can be used to process any media type.

     [[Note: It is possible, for instance, to split input data based on the
     media type information.]]

   - p:store

     The p:store can be used to store data of any media type. The XML
     serialization options are applied only for data that has an XML media type.

   - p:exec

     The p:exec step has a new optional options: result-content-type and
     errors-content-type. If specified, the options determine the media type of
     the standard output and standard error output, respectively, of the
     command.

     If the data that appears on the source input port is not XML, it is passed
     in its raw form to the command as its standard input (the "source-is-xml"
     option is ignored in this case). XML input data is processed in the usual
     way.

     It is a dynamic error (err:XC0035) if both result-is-xml and
     wrap-result-lines are true.

     It is a dynamic error (err:XC????) if result-is-xml is true and
     result-content-type is specified and is not an XML media type, or if
     result-is-xml is false and result-content-type is specified and is an XML
     media type.

     It is a dynamic error (err:XC????) if result-is-xml is false,
     result-content-type is specified, and wrap-result-lines is true.

     If result-is-xml is true, the standard output of the program is assumed to
     be XML and will be parsed as a single document. If it is false and
     result-content-type is not specified, the output is assumed not to be XML
     and will be returned as escaped text wrapped in a c:result element. If
     result-is-xml is false and result-content-type is specified, the raw data
     will be returned "as is."

     If result-content-type is specified, the result will be annotated with the
     specified media type.

     If wrap-result-lines is true, a c:line element will be wrapped around each
     line of output.

     The same rules apply to the standard error output of the program, with the
     errors-is-xml, errors-content-type, and wrap-error-lines options,
     respectively.

     [[Note: This feels too complicated, perhaps we can simplify it.]]

   - p:xquery

     If the media type of the data that appears on the query input port is
     application/xquery, the data is passed to the query engine "as
     is". (Conceptually, this can be seen as implicit wrapping in a c:data
     element with a content-type attribute set to "application/xquery".)

     [[Not sure if this is clean enough.]]

8. Overriding media type information

   On some occasions it may be necessary to be able to override the media type
   of the data: for example when the XProc processor fails to detect the media
   type (or detects it incorrectly), or when the pipeline author deliberately
   wants to use a different media type (for instance, to treat SVG data
   annotated as image/svg+xml as simply application/xml).

   The override media type can be specified statically - on the XProc binding
   elements - or dynamically - using an special step.

   On the binding level, the override media type is specified using the
   "as-content-type" attribute. The fragment below shows an example of how to
   ensure that the XQuery data read from an external file is annotated as
   application/xquery:

   ...
   <p:xquery>
     <p:input port="query">
       <p:data href="searchquery.xq" as-content-type="application/xquery"/>
     </p:input>
   </p:xquery>
   ...

   Specifying the override media type on the binding level has the disadvantage
   that it is static; the override media type cannot be constructed
   dynamically. A dynamic override media type can be specified using the
   as-content-type step:

   <p:declare-step type="as-content-type">
     <p:input port="source" sequence="true" content-type="*"/>
     <p:output port="result" sequence="true" content-type="*"/>
     <p:option name="content-type" required="true"/>
   </p:declare-step>

   The as-content-type step behaves as the standard p:identity step, except that
   it annotates the output data with the media type provided via the required
   content-type option.

   Note that applying an override media type does not result in data conversion
   from the original media type to the override type; the override media type
   merely replaces the original data media type annotation. If the override
   media type is incompatible with the data media type (for example, an
   application/xml override for application/pdf), it is reasonable to expect
   that subsequent processing may fail.

9. XPath extension functions

   The XProc XPath function library will be extended with one more function:
   p:content-type.

   The p:content-type function is declared as follows:

     p:content-type() as xs:string?
     p:content-type($arg as node()?) as xs:string?

   The function returns the value of the content-type property for $arg as
   defined by the accessor function xpdm:content-type() for that kind of node
   (see "Extensions to the XPath data model"). If $arg is not specified, the
   behavior is identical to calling the function with the context item (.) as
   argument.


Regards,
Vojtech

--
Vojtech Toman
Consultant Software Engineer
EMC | Information Intelligence Group
vojtech.toman@emc.com
http://developer.emc.com/xmltech

Received on Friday, 14 September 2012 13:11:07 UTC