- From: Toman, Vojtech <vojtech.toman@emc.com>
- Date: Fri, 14 Sep 2012 09:10:25 -0400
- To: "public-xml-processing-model-wg@w3.org" <public-xml-processing-model-wg@w3.org>
Hi all,
See below for the first proposal for non-XML data support in XProc. It is, with
some additional remarks here and there, more or less what I described in my
XML Prague 2012 paper:
http://www.xmlprague.cz/2012/files/xmlprague-2012-proceedings.pdf
(The paper also includes a couple of sample pipelines.)
---
1. Both XML and non-XML data can flow through the pipeline. XML data flows as
XML Infoset instances, and non-XML data as "raw" octet streams. I will refer
to the union of XML and non-XML data simply "data".
2. The data that flows through the pipeline is annotated with media type
information (application/xml, image/svg+xml, text/plain, image/jpeg etc.).
3. XProc steps declare what media types they expect on their input ports and
what media types they produce on their output ports. If data of an
incompatible media type arrives on a port of a step, the XProc processor
attempts to convert ("shim") the data to the appropriate media type using
some well-defined algorithm - see below.
4. To allow for non-XML support in XPath, we will introduce a number of XDM
extensions:
- A new property on the document node: content-type, possibly empty
- A new accessor defined on all kinds of XDM nodes:
xpdm:content-type($n as node()) as xs:string?
For the document node, the xpdm:content-type accessor returns the value of
the content-type property. For the other types of nodes (element,
attribute, text, namespace, processing instruction, and comment), it
returns the value of the content-type property of the owner document.
- A new binary data node to represent non-XML data. The binary data node has
the following properties:
- base-uri, possibly empty
- content-type, possibly empty
For the binary data node, the dm:base-uri accessor returns the value of the
base-uri property, and the xpdm:content-type accessor returns the value of
the content-type property. The dm:node-kind accessor returns the value
"binary-data". All other accessors defined on XDM nodes return the empty
sequence for the binary node.
[[Note that it should be possible to expose the octet sequence of the
binary data node by introducing a special property and an accessor
(representing the octets for instance as a sequence of xs:unsignedByte or
xs:integer). Not sure if want to go that route or not.]]
5. Shimming
While evaluating a pipeline, the XProc processor performs the following
algorithm when data appears on a port of a step:
- If the port media type is a wildcard or if the data media type is the same
as the port media type, the data appears on the port with no modifications;
otherwise
- if the XProc processor knows how to map (see the discussion below) from the
data media type to the port media type, the data is converted to the port
media type; otherwise
- the XProc processor performs one of the following fall-back actions:
- If both the data and the port media types are XML media types, the data
appears on the port with no modifications.
- If the port media type is application/xml, the data is processed as if it
was read via the standard XProc p:data binding with a c:data wrapper
element.
- If both the data and the port media types are text media types, the data
appears on the port with no modifications.
- Any other combination of the data and the port media types results in a
dynamic error.
[[Note: Some aspects of the above algorithm, especially the fall-back
behavior, may be questionable. This definitely needs some discussion.]]
An important aspect of the above algorithm is that it applies not only to the
input ports, but also to the output ports: before the data appears on an
output port, it is converted to the appropriate media type. This leads to a
number of interesting properties, especially in conjunction with compound
steps - it is, for example, possible to create a p:for-each loop whose
sub-pipeline produces data of all sorts of media types which are then
"consolidated" into one media type as specified on the p:for-each's output
port.
The media type conversion applies only to the p:input and p:output
elements. It does not take place when the XProc processor processes the
p:with-option, p:with-param, and p:variable elements, nor the
p:xpath-context, p:iteration-source, and p:viewport-source elements. It also
does not apply when the XProc processor evaluates the test expressions of
p:choose/p:when elements. In these cases, the XPath expressions use the
original data as the context item.
The kinds of mappings between different media types the XProc processor
supports is left implementation-defined.
[[Note: I think that to make the "shimming" feature interoperable and
actually useful at all, it should not be left too implementation-defined. I
think we would have to define a bunch of mappings between common media types
that the users can rely on. The downside of this is that this might be quite
hard and it might shift the focus of this specification into a whole
different direction.
A radical approach might be not to support shimming at all and simply say
that if data of incompatible media type arrives, you get a dynamic
error. Conversion between different media types can be left to
special-purpose custom (or standardized?) steps.]]
6. Modifications to the XProc language
- p:input, p:output
Media type annotations can be added to input and output port declarations
using the "content-type" attribute. The value of the content-type attribute
is either an exact media type string (such as application/xml) or a
wildcard, represented by the "*" character. If the content-type attribute
is not specified on a port declaration, the media type application/xml is
assumed.
The declaration below declares a step that accepts XML data on the source
input port and that produces PDF output on the result output port:
<p:declare-step>
<p:input port="source" content-type="application/xml"/>
<p:output port="result" content-type="application/pdf"/>
...
</p:declare-step>
The following example declares a step that can process and produce data of
any media type:
<p:declare-step>
<p:input port="source" content-type="*"/>
<p:output port="result" content-type="*"/>
...
</p:declare-step>
The media-type attribute cannot be used on parameter input ports; the media
type of parameter input ports is always application/xml.
- p:data
For the purpose of better supporting non-XML media types, the p:data
binding can return raw, not wrapped, data. [[Note: Possibly the biggest
breaking change to the language?]]
Where previously the p:data binding always encoded and wrapped the resource
referred to via the href attribute (the wrapper being either a
custom-specified element or the default c:data element), the modified
p:data only encodes and wraps the resource when the pipeline author
requests an explicit wrapper using the wrapper attribute. If no wrapper
element is specified, p:data returns the resource "as is".
The semantics of the content-type attribute of p:data remains the same: if
the resource comes with a media type annotation, that one must be used,
otherwise the media type specified in the content-type attribute should be
assumed. If no media type information can be associated with the resource,
the media type application/octet-stream is assumed.
- p:pipeline
The p:pipeline shortcut supports input and output of any media type by
default. It is equivalent to the following p:declare-step:
<p:declare-step>
<p:input port="source" primary="true" sequence="false" content-type="*"/>
<p:input port="parameters" primary="true" kind="parameter"/>
<p:input port="result" primary="true" sequence="false" content-type="*"/>
</p:declare-step>
This makes it possible to use p:pipeline to process both XML and non-XML
data easily.
- p:for-each
The p:for-each step can be used to process data of any media type. The
"current" implicit input port supports data of any media type. If the
p:for-each step contains explicit p:output declarations, then, inside of
the p:for-each, these output ports accept any media type regardless of the
value of the content-type attribute. On the outside of p:for-each, however,
the data appearing on the output port gets converted to the appropriate
media type. By default, the implicit output port of p:for-each supports any
media type.
- p:choose
The p:choose step can process data of any media type. The p:when branches
must declare the same numbers of output ports with the same names -
however, these output ports may specify different media types. By default,
the implicit output ports of p:when (and p:choose) support any media type.
- p:group
The p:group wrapper can be used to process data of any media type. By
default, the implicit output port of p:group supports any media type.
- p:try
The p:try step can be used to process data of any media type. The error
input port of p:catch (the XML representation of the dynamic error) accepts
data of the media type application/xml. The output ports of the p:group and
p:catch sub-pipelines of p:try must specify the same numbers of output
ports with the same names, but they may declare different media types. By
default, the implicit output port of p:catch supports any media type.
- p:data, p:document, p:inline, p:pipe
The p:data, p:document, p:inline, and p:pipe bindings can specify an
optional attribute "as-content-type" - for more details, see "Overriding
media type information."
7. Modifications to the XProc standard step library
Most atomic steps from the standard XProc step library are too XML-specific
(for instance p:xinclude or p:validate-with-xml-schema) to be easily
applicable to non-XML data. Having said that, however, there is a small
number of steps that can be - in some cases to great benefit - adapted for
non-XML data processing. This requires both modifying the implementations of
the steps and changing their declarations in the standard step library by
adding more relaxed media type annotations to their input and output
ports. The list below summarizes the changes in more detail:
- p:count
The p:count step can be used to process any media type. The output format
of p:count (a c:result document with the count) remains unchanged from the
specification.
- p:http-request
The p:http-request can produce output of any media type. Very much similar
to the p:data binding, the p:http-request step presents non-XML responses
in their raw form, not base64-encoding nor wrapping them anymore. If the
media type of the response data cannot be determined,
application/octet-stream is assumed.
The "detailed" response mode of p:http-request remains unchanged from the
original specification, and so does handling of multipart responses.
- p:identity
The p:identity step can be used to process any media type.
- p:sink
The p:sink step can be used to process any media type.
- p:split-sequence
The p:split-sequence step can be used to process any media type.
[[Note: It is possible, for instance, to split input data based on the
media type information.]]
- p:store
The p:store can be used to store data of any media type. The XML
serialization options are applied only for data that has an XML media type.
- p:exec
The p:exec step has a new optional options: result-content-type and
errors-content-type. If specified, the options determine the media type of
the standard output and standard error output, respectively, of the
command.
If the data that appears on the source input port is not XML, it is passed
in its raw form to the command as its standard input (the "source-is-xml"
option is ignored in this case). XML input data is processed in the usual
way.
It is a dynamic error (err:XC0035) if both result-is-xml and
wrap-result-lines are true.
It is a dynamic error (err:XC????) if result-is-xml is true and
result-content-type is specified and is not an XML media type, or if
result-is-xml is false and result-content-type is specified and is an XML
media type.
It is a dynamic error (err:XC????) if result-is-xml is false,
result-content-type is specified, and wrap-result-lines is true.
If result-is-xml is true, the standard output of the program is assumed to
be XML and will be parsed as a single document. If it is false and
result-content-type is not specified, the output is assumed not to be XML
and will be returned as escaped text wrapped in a c:result element. If
result-is-xml is false and result-content-type is specified, the raw data
will be returned "as is."
If result-content-type is specified, the result will be annotated with the
specified media type.
If wrap-result-lines is true, a c:line element will be wrapped around each
line of output.
The same rules apply to the standard error output of the program, with the
errors-is-xml, errors-content-type, and wrap-error-lines options,
respectively.
[[Note: This feels too complicated, perhaps we can simplify it.]]
- p:xquery
If the media type of the data that appears on the query input port is
application/xquery, the data is passed to the query engine "as
is". (Conceptually, this can be seen as implicit wrapping in a c:data
element with a content-type attribute set to "application/xquery".)
[[Not sure if this is clean enough.]]
8. Overriding media type information
On some occasions it may be necessary to be able to override the media type
of the data: for example when the XProc processor fails to detect the media
type (or detects it incorrectly), or when the pipeline author deliberately
wants to use a different media type (for instance, to treat SVG data
annotated as image/svg+xml as simply application/xml).
The override media type can be specified statically - on the XProc binding
elements - or dynamically - using an special step.
On the binding level, the override media type is specified using the
"as-content-type" attribute. The fragment below shows an example of how to
ensure that the XQuery data read from an external file is annotated as
application/xquery:
...
<p:xquery>
<p:input port="query">
<p:data href="searchquery.xq" as-content-type="application/xquery"/>
</p:input>
</p:xquery>
...
Specifying the override media type on the binding level has the disadvantage
that it is static; the override media type cannot be constructed
dynamically. A dynamic override media type can be specified using the
as-content-type step:
<p:declare-step type="as-content-type">
<p:input port="source" sequence="true" content-type="*"/>
<p:output port="result" sequence="true" content-type="*"/>
<p:option name="content-type" required="true"/>
</p:declare-step>
The as-content-type step behaves as the standard p:identity step, except that
it annotates the output data with the media type provided via the required
content-type option.
Note that applying an override media type does not result in data conversion
from the original media type to the override type; the override media type
merely replaces the original data media type annotation. If the override
media type is incompatible with the data media type (for example, an
application/xml override for application/pdf), it is reasonable to expect
that subsequent processing may fail.
9. XPath extension functions
The XProc XPath function library will be extended with one more function:
p:content-type.
The p:content-type function is declared as follows:
p:content-type() as xs:string?
p:content-type($arg as node()?) as xs:string?
The function returns the value of the content-type property for $arg as
defined by the accessor function xpdm:content-type() for that kind of node
(see "Extensions to the XPath data model"). If $arg is not specified, the
behavior is identical to calling the function with the context item (.) as
argument.
Regards,
Vojtech
--
Vojtech Toman
Consultant Software Engineer
EMC | Information Intelligence Group
vojtech.toman@emc.com
http://developer.emc.com/xmltech
Received on Friday, 14 September 2012 13:11:07 UTC