Comments on the XML Encryption Requirements

Greatings, encryption folk!

XML Schema working group has reviewed the XML Encryption Requirements
document.  I have been asked to send you the following comments.
Please feel free to forward any questions of clarity or intent back to
me.

Thanks,
- Roß Thompson
  on behalf of the XML Schema Working Group

--------------------

When validating an XML document, the model that XML Schema uses is
that the document be in the form of an infoset, not an XML data
stream.  The purpose of an XML Schema is to express constraints on the
form that the infoset can take.  There is a feeling among some of the
Schema WG that trying to validate an infoset that contains encrypted
data amounts to a mixture of levels -- the "natural" schema you would
write for an XML document would describe the information derived from
the unencrypted data.  The act of encrypting data obscures that data
from a XML processor which does not possess the decryption keys, and
therefore changes the infoset that derives from the serialized
document.

Virtually all of the issues mentioned below arise because of this
level mixing.  The Schema working group has, from time to time,
considered the viability of co-occurrence constraints, which might be
used to alleviate some of the problems, but Schema has no immediate
plans to include such constraints.  We also discussed the possibility
of using complex type unions to address some of the concerns, but we
similarly have no immediate plans to introduce such types.

Finally, substep 1 of step 3 of the encryption processing rules
(listed in section 4.1) specifies the encryption of character strings.
Would it be better to sign or encrypt pieces of the infoset?  For
example, if ignorable whitespace is introduced into the document's
serialized form, do you want the encrypted form of the document to be
sensitive to this?  Schema does not presume to tell Encryption how to
do their business, but we felt this was an issue worth raising.

To amplify this point, consider the following two cases:

    1) Infosets may exist for which no XML serialization is ever
    created.  Consider a document created through a DOM, stored in an
    XML database that uses optimized internal representations of the
    Infoset.  Presumably, the consuming application could be provided
    a DOM or SAX interface without ever creating an "<...>" form
    serialization.  If that database is used as the backing store for
    a workflow application, it's extremely useful to be able to
    encrypt fragments of the document, but creating and storing an XML
    1.0 or XML 1.1 serialization merely to encrypt it is artificial.
    Note that Schema has taken some trouble to base itself on Infoset,
    so that such non-serialized documents can indeed be validated.

    2)  Consider this instance:

    <xx:foo xmlns:xx="namespace-uri1"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
       <xx:bar xsi:type="xx:myBoolean">0</xx:bar>
    </xx:foo>

    with a schema that includes

    <xs:simpleType name="myBoolean">
       <xs:restriction base="xs:boolean">
          <xs:pattern value="0|1"/>
       </xs:restriction>
    </xs:simpleType>
    <xs:simpleType name="myUnion" memberTypes="xs:integer myBoolean"/>
    <xs:element name="bar" type="myUnion"/>

    If the element bar is encrypted, rebinding the prefixes before
    decryption will cause validation to fail (among other things),
    first because the prefix on element 'bar' will be wrong, and
    second because the prefix in the value of 'xsi:type' will be
    wrong, in a way that has the potential to affect validation or
    even the interpreted value of the element.

    This says that there is a strong requirement that no application
    ever change namespace prefixes on a document with encrypted
    elements.  We find this to be worthy of a very salient warning to
    users and implementors, at the very least.  We would like to
    encourage a mechanism that was not as fragile in this area, and
    which did not introduce an non-compositionality of processing.

    We recognize that the namespace abbreviation is part of the
    infoset, and that resolving this issue will require some hard
    thinking.  It's a nasty problem, and we don't have a ready
    solution, but we think that is what makes it worth considering.

These two points together argue strongly that it is the contents of
the infoset that should be signed, and not the serialization of the
infoset.

              ------------------------------------------

Some observations that came to mind when reading through the proposed
specification are:

- If the XML processor knows the decryption keys, then the infoset for
  the document is just as if the plain text XML were in place.  In
  this case, there is no impact as regards Schema, because the fact of
  encryption has been hidden from the schema processor.  In short,
  this is not an issue.

- If the XML processor does not know the decryption keys, then the XML
  infoset will contain the elements that represented the data in its
  encrypted form.  In this case, there are severe limitations on
  schema validation, because as far as validation is concerned, the
  encryption elements have no special status.  In particular:

  - There will be no way for the schema validation to verify that the
    encrypted XML conforms with the schema.

  - Unless the schema is written with encryption in mind, the
    processor will not be able to strictly assess even the unencrypted
    portions of the document against the schema.  If lax validation is
    allowed, then certain cases will validate correctly, but most
    won't.  Obviously, skip validation will pass, but this provides no
    information about document correctness.

- Writing a schema that allows encryption will be difficult, unless
  encryption is only allowed at a few certain points in the document.
  Consider the following schema:

<xs:schema>
   <xs:element name="the_corn">
     <xs:complexType>
       <xs:sequence>
         <xs:element name="kernel" type="xs:string"/>
         <xs:element name="husk" type="xs:string"/>
         <xs:element name="cob" type="xs:string"/>
       </xs:sequence>
     </xs:complexType>
   </xs:element>
</xs:schema>

To add the ability to encrypt the children of the_corn, you would have
to write:

<xs:schema>
   <xs:element name="the_corn">
     <xs:complexType>
       <xs:choice>
         <xs:sequence>
           <xs:choice>
             <xs:element name="kernel" type="xs:string"/>
             <xs:element ref="enc:EncryptedData"/>
           </xs:choice>
           <xs:choice>
             <xs:element name="husk" type="xs:string"/>
             <xs:element ref="enc:EncryptedData"/>
           </xs:choice>
           <xs:choice>
             <xs:element name="cob" type="xs:string"/>
             <xs:element ref="enc:EncryptedData"/>
           </xs:choice>
         </xs:sequence>
         <xs:element ref="enc:EncryptedData"/>
       </xs:choice>
     </xs:complexType>
   </xs:element>
</xs:schema>

And even this doesn't capture it, because you really want to be able
to encrypt "kernel" and "husk" in a single EncryptedData block, and
have "cob" be plain text.  In fact, in order to capture that
additional complexity requires that you violate the UPA constraint, so
there is no legal schema that has this flexibility.  (Actually, the
UPA constraint makes even the above schema illegal if the minOccurs !=
maxOccurs on any of the children of the_corn.)

A possible approach to resolving this problem, which Schema would
encourage you to consider, is to specify not a specific element, but a
complex type of encrypted data.  This would allow the schema author to
specify element X and an encrypted form of X as alternatives.  So, the
original schema might be rewritten thus:

<xs:schema>
   <xs:element name="the_corn">
     <xs:complexType>
       <xs:sequence>
         <xs:choice>
           <xs:element name="kernel" type="xs:string"/>
           <xs:element name="kernel-enc" type="enc:EncryptedData"/>
         </choice>
         <xs:choice>
           <xs:element name="husk" type="xs:string"/>
           <xs:element name="husk-enc" type="enc:EncryptedData"/>
         </choice>
         <xs:choice>
           <xs:element name="cob" type="xs:string"/>
           <xs:element name="cob-enc" type="enc:EncryptedData"/>
         </choice>
       </xs:sequence>
     </xs:complexType>
   </xs:element>
</xs:schema>

This is a fine schema, in terms of the Unique Particle Attribution
constraint, and allows for an arbitrary decisions on which of the
children of the_corn are encrypted.  Unfortunately, this approach
still does not allow for encoding multiple children in the same
encrypted data segment.  Adding such complexity to the schema would
make it unwieldy.

(It was observed in discussing this proposal that developers of
encryption processors may prefer an element, which they could be
guaranteed of recognizing by its QName, to a type, which would require
them to use a schema processor upstream.  One solution to this dilemma
might be to specify a required attribute with a fixed value as part of
the complex type (so that elements of types husk-enc, say, was
required to have the attribute value specification
enc:EncryptedData="...").  The value of the attribute, could be a
boolean, or a version number, or information about the key, or a
public key, or whathaveyou.)

(Another observation made during the discussion: An agent in posession
of a schema for the plaintext document will be able to infer
information about what tags are encoded.  If a schema calls for
elements A, B, and C, in order, and the instance document contains A,
B, and an EncryptedData tag, it is fairly obvious what tag has been
encrypted.  This could, perhaps, facilitate some decryption attacks,
because it gives the attacker knowledge of some of the plaintext.  In
particular, it is very likely that the text begins with "<C" and ends
with "</C>".  We recommend a note for implementors and users of XML
Encryption that warns them of this.)

If the places where the encryption can appear in the instance document
is fairly small, then doctoring the schema as above is practical,
though perhaps painful.  If it is not small, then it is really
impractical, which in turn means that validation of documents
containing encrypted content is not practical for a processor that
does not have access to the decryption keys.

Received on Wednesday, 16 October 2002 18:45:01 UTC