- From: Toman, Vojtech <vojtech.toman@emc.com>
- Date: Fri, 14 Sep 2012 09:10:25 -0400
- To: "public-xml-processing-model-wg@w3.org" <public-xml-processing-model-wg@w3.org>
Hi all, See below for the first proposal for non-XML data support in XProc. It is, with some additional remarks here and there, more or less what I described in my XML Prague 2012 paper: http://www.xmlprague.cz/2012/files/xmlprague-2012-proceedings.pdf (The paper also includes a couple of sample pipelines.) --- 1. Both XML and non-XML data can flow through the pipeline. XML data flows as XML Infoset instances, and non-XML data as "raw" octet streams. I will refer to the union of XML and non-XML data simply "data". 2. The data that flows through the pipeline is annotated with media type information (application/xml, image/svg+xml, text/plain, image/jpeg etc.). 3. XProc steps declare what media types they expect on their input ports and what media types they produce on their output ports. If data of an incompatible media type arrives on a port of a step, the XProc processor attempts to convert ("shim") the data to the appropriate media type using some well-defined algorithm - see below. 4. To allow for non-XML support in XPath, we will introduce a number of XDM extensions: - A new property on the document node: content-type, possibly empty - A new accessor defined on all kinds of XDM nodes: xpdm:content-type($n as node()) as xs:string? For the document node, the xpdm:content-type accessor returns the value of the content-type property. For the other types of nodes (element, attribute, text, namespace, processing instruction, and comment), it returns the value of the content-type property of the owner document. - A new binary data node to represent non-XML data. The binary data node has the following properties: - base-uri, possibly empty - content-type, possibly empty For the binary data node, the dm:base-uri accessor returns the value of the base-uri property, and the xpdm:content-type accessor returns the value of the content-type property. The dm:node-kind accessor returns the value "binary-data". All other accessors defined on XDM nodes return the empty sequence for the binary node. [[Note that it should be possible to expose the octet sequence of the binary data node by introducing a special property and an accessor (representing the octets for instance as a sequence of xs:unsignedByte or xs:integer). Not sure if want to go that route or not.]] 5. Shimming While evaluating a pipeline, the XProc processor performs the following algorithm when data appears on a port of a step: - If the port media type is a wildcard or if the data media type is the same as the port media type, the data appears on the port with no modifications; otherwise - if the XProc processor knows how to map (see the discussion below) from the data media type to the port media type, the data is converted to the port media type; otherwise - the XProc processor performs one of the following fall-back actions: - If both the data and the port media types are XML media types, the data appears on the port with no modifications. - If the port media type is application/xml, the data is processed as if it was read via the standard XProc p:data binding with a c:data wrapper element. - If both the data and the port media types are text media types, the data appears on the port with no modifications. - Any other combination of the data and the port media types results in a dynamic error. [[Note: Some aspects of the above algorithm, especially the fall-back behavior, may be questionable. This definitely needs some discussion.]] An important aspect of the above algorithm is that it applies not only to the input ports, but also to the output ports: before the data appears on an output port, it is converted to the appropriate media type. This leads to a number of interesting properties, especially in conjunction with compound steps - it is, for example, possible to create a p:for-each loop whose sub-pipeline produces data of all sorts of media types which are then "consolidated" into one media type as specified on the p:for-each's output port. The media type conversion applies only to the p:input and p:output elements. It does not take place when the XProc processor processes the p:with-option, p:with-param, and p:variable elements, nor the p:xpath-context, p:iteration-source, and p:viewport-source elements. It also does not apply when the XProc processor evaluates the test expressions of p:choose/p:when elements. In these cases, the XPath expressions use the original data as the context item. The kinds of mappings between different media types the XProc processor supports is left implementation-defined. [[Note: I think that to make the "shimming" feature interoperable and actually useful at all, it should not be left too implementation-defined. I think we would have to define a bunch of mappings between common media types that the users can rely on. The downside of this is that this might be quite hard and it might shift the focus of this specification into a whole different direction. A radical approach might be not to support shimming at all and simply say that if data of incompatible media type arrives, you get a dynamic error. Conversion between different media types can be left to special-purpose custom (or standardized?) steps.]] 6. Modifications to the XProc language - p:input, p:output Media type annotations can be added to input and output port declarations using the "content-type" attribute. The value of the content-type attribute is either an exact media type string (such as application/xml) or a wildcard, represented by the "*" character. If the content-type attribute is not specified on a port declaration, the media type application/xml is assumed. The declaration below declares a step that accepts XML data on the source input port and that produces PDF output on the result output port: <p:declare-step> <p:input port="source" content-type="application/xml"/> <p:output port="result" content-type="application/pdf"/> ... </p:declare-step> The following example declares a step that can process and produce data of any media type: <p:declare-step> <p:input port="source" content-type="*"/> <p:output port="result" content-type="*"/> ... </p:declare-step> The media-type attribute cannot be used on parameter input ports; the media type of parameter input ports is always application/xml. - p:data For the purpose of better supporting non-XML media types, the p:data binding can return raw, not wrapped, data. [[Note: Possibly the biggest breaking change to the language?]] Where previously the p:data binding always encoded and wrapped the resource referred to via the href attribute (the wrapper being either a custom-specified element or the default c:data element), the modified p:data only encodes and wraps the resource when the pipeline author requests an explicit wrapper using the wrapper attribute. If no wrapper element is specified, p:data returns the resource "as is". The semantics of the content-type attribute of p:data remains the same: if the resource comes with a media type annotation, that one must be used, otherwise the media type specified in the content-type attribute should be assumed. If no media type information can be associated with the resource, the media type application/octet-stream is assumed. - p:pipeline The p:pipeline shortcut supports input and output of any media type by default. It is equivalent to the following p:declare-step: <p:declare-step> <p:input port="source" primary="true" sequence="false" content-type="*"/> <p:input port="parameters" primary="true" kind="parameter"/> <p:input port="result" primary="true" sequence="false" content-type="*"/> </p:declare-step> This makes it possible to use p:pipeline to process both XML and non-XML data easily. - p:for-each The p:for-each step can be used to process data of any media type. The "current" implicit input port supports data of any media type. If the p:for-each step contains explicit p:output declarations, then, inside of the p:for-each, these output ports accept any media type regardless of the value of the content-type attribute. On the outside of p:for-each, however, the data appearing on the output port gets converted to the appropriate media type. By default, the implicit output port of p:for-each supports any media type. - p:choose The p:choose step can process data of any media type. The p:when branches must declare the same numbers of output ports with the same names - however, these output ports may specify different media types. By default, the implicit output ports of p:when (and p:choose) support any media type. - p:group The p:group wrapper can be used to process data of any media type. By default, the implicit output port of p:group supports any media type. - p:try The p:try step can be used to process data of any media type. The error input port of p:catch (the XML representation of the dynamic error) accepts data of the media type application/xml. The output ports of the p:group and p:catch sub-pipelines of p:try must specify the same numbers of output ports with the same names, but they may declare different media types. By default, the implicit output port of p:catch supports any media type. - p:data, p:document, p:inline, p:pipe The p:data, p:document, p:inline, and p:pipe bindings can specify an optional attribute "as-content-type" - for more details, see "Overriding media type information." 7. Modifications to the XProc standard step library Most atomic steps from the standard XProc step library are too XML-specific (for instance p:xinclude or p:validate-with-xml-schema) to be easily applicable to non-XML data. Having said that, however, there is a small number of steps that can be - in some cases to great benefit - adapted for non-XML data processing. This requires both modifying the implementations of the steps and changing their declarations in the standard step library by adding more relaxed media type annotations to their input and output ports. The list below summarizes the changes in more detail: - p:count The p:count step can be used to process any media type. The output format of p:count (a c:result document with the count) remains unchanged from the specification. - p:http-request The p:http-request can produce output of any media type. Very much similar to the p:data binding, the p:http-request step presents non-XML responses in their raw form, not base64-encoding nor wrapping them anymore. If the media type of the response data cannot be determined, application/octet-stream is assumed. The "detailed" response mode of p:http-request remains unchanged from the original specification, and so does handling of multipart responses. - p:identity The p:identity step can be used to process any media type. - p:sink The p:sink step can be used to process any media type. - p:split-sequence The p:split-sequence step can be used to process any media type. [[Note: It is possible, for instance, to split input data based on the media type information.]] - p:store The p:store can be used to store data of any media type. The XML serialization options are applied only for data that has an XML media type. - p:exec The p:exec step has a new optional options: result-content-type and errors-content-type. If specified, the options determine the media type of the standard output and standard error output, respectively, of the command. If the data that appears on the source input port is not XML, it is passed in its raw form to the command as its standard input (the "source-is-xml" option is ignored in this case). XML input data is processed in the usual way. It is a dynamic error (err:XC0035) if both result-is-xml and wrap-result-lines are true. It is a dynamic error (err:XC????) if result-is-xml is true and result-content-type is specified and is not an XML media type, or if result-is-xml is false and result-content-type is specified and is an XML media type. It is a dynamic error (err:XC????) if result-is-xml is false, result-content-type is specified, and wrap-result-lines is true. If result-is-xml is true, the standard output of the program is assumed to be XML and will be parsed as a single document. If it is false and result-content-type is not specified, the output is assumed not to be XML and will be returned as escaped text wrapped in a c:result element. If result-is-xml is false and result-content-type is specified, the raw data will be returned "as is." If result-content-type is specified, the result will be annotated with the specified media type. If wrap-result-lines is true, a c:line element will be wrapped around each line of output. The same rules apply to the standard error output of the program, with the errors-is-xml, errors-content-type, and wrap-error-lines options, respectively. [[Note: This feels too complicated, perhaps we can simplify it.]] - p:xquery If the media type of the data that appears on the query input port is application/xquery, the data is passed to the query engine "as is". (Conceptually, this can be seen as implicit wrapping in a c:data element with a content-type attribute set to "application/xquery".) [[Not sure if this is clean enough.]] 8. Overriding media type information On some occasions it may be necessary to be able to override the media type of the data: for example when the XProc processor fails to detect the media type (or detects it incorrectly), or when the pipeline author deliberately wants to use a different media type (for instance, to treat SVG data annotated as image/svg+xml as simply application/xml). The override media type can be specified statically - on the XProc binding elements - or dynamically - using an special step. On the binding level, the override media type is specified using the "as-content-type" attribute. The fragment below shows an example of how to ensure that the XQuery data read from an external file is annotated as application/xquery: ... <p:xquery> <p:input port="query"> <p:data href="searchquery.xq" as-content-type="application/xquery"/> </p:input> </p:xquery> ... Specifying the override media type on the binding level has the disadvantage that it is static; the override media type cannot be constructed dynamically. A dynamic override media type can be specified using the as-content-type step: <p:declare-step type="as-content-type"> <p:input port="source" sequence="true" content-type="*"/> <p:output port="result" sequence="true" content-type="*"/> <p:option name="content-type" required="true"/> </p:declare-step> The as-content-type step behaves as the standard p:identity step, except that it annotates the output data with the media type provided via the required content-type option. Note that applying an override media type does not result in data conversion from the original media type to the override type; the override media type merely replaces the original data media type annotation. If the override media type is incompatible with the data media type (for example, an application/xml override for application/pdf), it is reasonable to expect that subsequent processing may fail. 9. XPath extension functions The XProc XPath function library will be extended with one more function: p:content-type. The p:content-type function is declared as follows: p:content-type() as xs:string? p:content-type($arg as node()?) as xs:string? The function returns the value of the content-type property for $arg as defined by the accessor function xpdm:content-type() for that kind of node (see "Extensions to the XPath data model"). If $arg is not specified, the behavior is identical to calling the function with the context item (.) as argument. Regards, Vojtech -- Vojtech Toman Consultant Software Engineer EMC | Information Intelligence Group vojtech.toman@emc.com http://developer.emc.com/xmltech
Received on Friday, 14 September 2012 13:11:07 UTC