Streamable schema-aware processing

In response to the discussion today, I propose to add the following paragraphs at the end of section 2.10.


Streaming can be combined with schema-aware processing: that is, the streamed input to a transformation can be subjected to on-the-fly validation, a process which typically accepts an input stream from the XML parser and delivers an output stream (of type-annotated nodes) to the transformation processor. The XSD specification is designed so that validation is, with one or two exceptions, a streamable process. The exceptions include:

* There may be a need to allocate memory to hold keys, in order to enforce uniqueness and referential integrity constraints (xs:unique, xs:key, xs:keyref).

* In XSD 1.1, assertions can be defined by means of XPath expressions. These are not constrained to be streamable; in the general case, any subtree of the document that is validated using an assertion may need to be buffered in memory while the assertion is processed.

Applications that need to run in finite memory may therefore need to avoid these XSD features, or to use them with care.

XSD is designed so that the type annotation of an element can be decided as soon as the start tag of the element is encountered. At this point it is known that the element will either be of a certain type, or it will be invalid. If it turns out to be invalid, then this can always be established by the time the element’s end tag is encountered. To ensure that the XSLT processor never sees invalid data, it is necessary that the schema processor should detect validity errors as early as possible.

By default, dynamic errors occurring during streamed processing are fatal: they typically cause the transformation to fail immediately. XSLT 3.0 introduces the ability to catch dynamic errors and recover from them. A validation failure, however, represents a failure of the instruction that processes an entire input stream, so after a validation failure, no further processing of that input stream is possible.

A streamed transformation that only accesses part of the input document (for example, the metadata at the start of a document) is not required to read the entire document once the data it requires has been read. This means that XML well-formedness or validity errors occurring in the unread part of the input stream may go undetected.


Michael Kay
Saxonica

Received on Thursday, 8 October 2015 20:08:50 UTC