MTOM Inclusion Format For You (MIFFY)

2003-11-28

1. Introduction

This document specifies the MIFFY format, a means of serializing XML infosets that contain certain types of content more efficiently.

A MIFFY document is created by placing a serialization of the XML infoset inside of an extensible packaging format and then re-encoding selected portions of its content alongside it, while marking their locations in the XML with a special element that links to the packaged data using URIs.

Specifically, MIFFY optimizes those XML elements that have base64Binary encoded content, and does so by packaging them in the MIME Multipart/Related format.

1.1 Terminology

The following terms are defined and used by this document;

Target Infoset - the original XML infoset to be optimized
Optimization Candidate - a portion of the Target Infoset that has been identified as suitable for optimization
Optimized Infoset - The Target Infoset after optimization has occured
Optimized Data - The data corresponding to Optimization Candidates which have been removed from the Target Infoset after optimization has occured
MIFFY document - The package containing the Optimized Infoset and any Optimized Data

1.2 Notational Conventions

The keywords "MUST", , "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC 2119].

This specification uses a number of namespace prefixes throughout; they are listed below. Note that the choice of any namespace prefix is arbitrary and not semantically significant (see XML Infoset [XML InfoSet]).

xbinc - [TBD]
mime - [TBD]

1.3 Terminology

This specification refers to XML constructs using the XML Infoset [XML Infoset] terminology, as well as that defined by the XML Query Data Model [XML Query Data Model].

Infosets capture the tree structure, the names of the elements, the character content of elements and attributes, and so on. The Infoset does not model schema data types, such as integer, and thus provides no association between character strings and values.

The purpose of this format is to optimize certain XML-based structures by relying on type information that may be available at serialization time. MIFFY does not specify any particular means by which such type information is to be determined: schema validation is one possibility, but serializers MAY determine or establish types using other means. Type information need be provided only for element nodes that are to be optimized.

Unlike the Infoset, the XQuery 1.0 and XPath 2.0 Data Model ([XML Query Data Model] ... hereinafter referred to as the "data model") provides a model that carries type and value space information for each element and attribute. Accordingly, MIFFY is expressed in terms of that data model. A precondition for use of this format is therefore availability of a data model for the structure to be serialized. Details of the correspondence between Infosets and data models are provided in A. Mapping between Infosets and Data Models. The data model introduces accessors such as dm:string-value, dm:type and dm:typed-value, which are used in this specification.

Many applications of XML do not, in general, include schema type information. Accordingly, this specification does not require that dm:type or dm:typed-value be reconstructed. This format thus implements a "lossy" model, in which type information available to the serializer may be used for purposes of optimization, but need not in general be provided to the deserializer (except insofar as necessary to perform deserialization). The data models at both ends are therefore identical in overall structure, dm:string-value, and dm:children content, but not necessarily with respect to dm:type and dm:typed-value.

2. MIFFY constructs

A MIFFY document is constructed as a MIME Multipart/Related package with an XML root part, which is a serialization of the Optimized Infoset. These constructs are specified in detail below.

2.1 MIME Multipart packaging

MIFFY Documents MUST be valid MIME Multipart/Related documents, as specified by [rfc2387]. Ordering of MIME parts MUST NOT be considered significant to MIFFY processing or to the construction of the Target Infoset.

The root MIME part MUST be an XML 1.0 serialization [xml1.0] of the Optimized Infoset, and MUST be identified with the [ TBD ] media type.

2.2 Optimized Infoset

The Optimized Infoset MAY contain any information item, and SHOULD contain xbinc:Include element information items. Information items other than those defined below MUST be ignored for the purposes of MIFFY processing.

2.2.1 xbinc:Include element information item

The xbinc:Include element node accessor values are as follows:

dm:node-kind MUST be element.
dm:node-name MUST be {http://www.w3.org/2003/06/soap/features/binary-inclusion;Include} .
There MUST NOT be element nodes among dm:children.
There MAY be more than one attribute nodes comprising dm:attributes. Among these MUST be the following:
- href attribute node (see 2.2 href attribute).
- dm:nilled MUST be false.
- Other accessors such as dm:parent MUST be set according to the context.

2.2.2 href attribute information item

The href attribute node has the following data model accessor values:

dm:node-kind MUST be attribute.
dm:node-name MUST be {(no-namespace-URI);href}.
dm:type MUST be xs:anyURI.
dm:typed-value MUST be a URI referencing the part of the multipart serialization comprising the data logically included by the dm:parent element (I.e. the xbinc:Include).
Other accessors such as dm:parent MUST be set according to the context.

2.2.3 mime:content-type attribute information item

[ TBD ]

3. MIFFY Processing Model

Unless otherwise stated, processing MUST be semantically equivalent to performing the specified steps separately, and in the order given.

3.1 Creating MIFFY Documents

To create a MIFFY Document from an XML Infoset;

Create a MIME Multipart/Related package.
Create an Optimized Infoset by identifying Optimization Candidates and performing the following steps for each:
1. Replace it in the Optimized Infoset with an xbinc:Include element information item.
2. Transform the replaced characters into binary data by processing them as base64-encoded data.
3. Serialize the binary data into a MIME part inside the Multipart/Related package, with the following MIME metadata;
  - If the URI used in the value of the href attribute information item has a 'cid' scheme, the MIME part's Content-ID header field MUST have a corresponding field-value.
  - Otherwise, the MIME part's Content-Location header field MUST have a field-value identical to the URI in the value of the href attribute information item.
  - If the parent element information item of the Optimization Candidate has a mime:content-type attribute information item, its value SHOULD be identical to the field-value of the MIME part's Content-Type header field.
Serialize the Optimized Infoset in the Multipart/Related package as XML 1.0 and identify it as the root MIME part.

3.2 Interpreting MIFFY Documents

To create an XML Infoset from a MIFFY document;

Considering the MIFFY Document as a MIME Multipart/Related package, identify the root part as the Optimized Infoset.
For each occurance of the xbinc:Include element in the Optimized Infoset;
1. Locate the MIME part identified by the URI in the href attribute information item.
2. Replace the xbinc:Include element information item with the canonical base64 encoding of the entity body of the identifed MIME part.

4. Selecting Optimization Candidates

Optimization in MIFFY is limited to the content of those element information items which contain characters that can be interpreted as base64-encoded data. Attributes and non-base64-compatible character data cannot be successfully optimized by MIFFY.

Because optimization candidates are transformed to binary data, and then re-encoded as canonical base64, care should be taken in selecting them. In particular, if the lexical form of the base64 data is important to preserve (e.g., a whitespace-sensitive signature algorithm is being used), it is important to ensure that either the form in the Target Infoset is canonical, or that such content is not selected as an optimization candidate.

If an optimization candidate cannot be successfully encoded into the optimized infoset, implementations SHOULD behave as if that portion of the Target Infoset were not identified as an optimization candidate.

5. Identifying MIFFY Documents

[ TBD, depending on media type feedback ]

6. Security Considerations

[ TBD ]

A. Mapping between Infosets and Data Models

This specification uses the XQuery 1.0 and XPath 2.0 Data Model to augment the information available in Infosets with typing information, which is used as the basis for optimization. This Appendix sets out in detail the correspondence between Infosets and data models, for purposes of implementation of this specification.

A.1 Serialization Infoset Mapping

The [XML Query Data Model] provides a normative mapping from the Post Schema Validation Infoset to a data model. Except as specified here, that mapping is used to construct data models from infosets during serialization. The differences are as follows:

This specification does not require schema validation by any party. The means by which dm:type and dm:typed-value are determined are at the discretion of the serializer, except that the dm:typed-value must be consistent with the dm:string-value for the assigned dm:type.
In the case where no type information is available, perhaps because no schema validation was performed or because no type was assigned by such validation, the conventions described at dm:type. MUST be used to indicate that the type is indeterminate.

EDNOTE [NRM]: Should xdt:untypedAtomic be used for leaf nodes with only text content? Seems preferable to me, but for some reason the dm is looser.

A.2 Deserialization Infoset Mapping

The [XML Query Data Model] provides a normative mapping from a Data Model to an Infoset. That mapping is used to construct an infoset during deserialization. Note that this mapping makes use only of dm:string and text node dm:children: in no case is the dm:type or dm:typed-value used to construct the Infoset. Thus, this mapping enforces the goal of this feature, which is to use type information as a means of optimization, without affecting application semantics.