Re: Last call comments on XML Encryption specs from Martin Duerst on 2002-01-08 (xml-encryption@w3.org from January 2002)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 08 Jan 2002 17:45:56 +0900
To: reagle@w3.org, dee3@torque.pothole.com, "Takeshi Imamura" <IMAMU@jp.ibm.com>
Cc: w3c-i18n-ig@w3.org, xml-encryption@w3.org
Message-Id: <4.2.0.58.J.20020108165418.04a45418@localhost>
Hello Joseph,

Sorry for the delay over the holidays.

At 15:33 01/12/17 -0500, Joseph Reagle wrote:
>On Saturday 01 December 2001 21:02, Martin Duerst wrote:
> > The syntax/processing is basically right (in the sense
> > that XML is serialized using UTF-8). However, there is no
> > corresponding requirement for it, and there is none of
> > the details, 'health warnings' and security warnings that
> > we worked out for the XML Signature spec and that I would
> > have expected to be reused.
>
>Hi Martin, while I learned a lot from XML Signature I fear I haven't
>completely internalized all the i18n-goodness yet, so I appreciate
>continued patience!
>
> > For the Requirements doc at:
> > http://www.w3.org/TR/2001/WD-xml-encryption-req-20011018
> >
> > - There should be a requirement that says that encryption
> >    should work (in the sense that you get the original
> >    stuff back after decription) under Infoset-preserving
> >    transformations of the XML that contains the encrypted
> >    pieces. [This makes sure that when encrypting XML, it
> >    has to be in a defined encoding (as it currently is).]
>
>I'm not quite sure what you mean. I prefer not to mention Infoset since
>we're still working with XML1.0 and XPath, and I'm not sure if we will
>issue another version of the requirement, but I'd like to make sure this
>concern (once understood) is addressed by the spec.

It is already addressed by the spec. But there is no corresponding
requirement. While the requirement that one gets the same stuff
back that one encrypted seems pretty straightforward, at least
for character encoding issues, it might be easy to get wrong.
You can get rid of mentioning the infoset by saying that the
following two sequences of operations must lead to the same
document:

a) 1) Encrypting part of the document
    2) Transcoding it from one character encoding to another
    3) Decrypting what was encrypted in step 1)
b) 1) Encrypting part of the document
    2) Decrypting what was encrypted in step 1)
    3) Transcoding it from one character encoding to another

>In the spec we say on
>encryption, "obtain the octets by serializing the data in UTF-8 as
>specified in [XML]" and on decryption if the data was an XML element or
>content its UTF8, if not its just octets. BTW: Did you have a chance to
>look at:
>   http://www.w3.org/Encryption/2001/Drafts/xmlenc-decrypt.html
>There's a lot of XML/character processing involved there.

I had a look at it. I haven't found anything about
character encoding issues. Did I miss something?


> > - There needs to be a requirement to use NFC when converting
> >    from a legacy encoding to UTF-8 when encrypting. This should
> >    be very much the same as in XML Signature, Section 6.5
> >    (http://www.w3.org/TR/xmldsig-core/#sec-c14nAlg), last two
> >    paragraphs. There should also be something like the last
> >    paragraph before section 7.1
> >    (http://www.w3.org/TR/xmldsig-core/#sec-XML-Canonicalization).
>
>I've added to the last sentence in section 3.1:
>
>http://www.w3.org/Encryption/2001/Drafts/xmlenc-core/Overview.html#sec-Encr 
>yptedType
>EncryptedType is the abstract type from which EncryptedData and
>EncryptedKey are derived. While these two latter element types are very
>similar with respect to their content models, a syntactical distinction is
>useful to processing. Implementation MUST generate laxly schema valid
>[XML-schema] EncryptedData or EncryptedKey as specified by the subsequent
>schema declarations /+and SHOULD create XML content (EncryptedTypeelements
>and their descendents/content) in Normalization Form C [NFC,
>NFC-Corrigendum].+/

The 'SHOULD create XML content in Normalization Form C' is for
content generated by the encryption. This is okay, but:

- In the newest version of the Character Model, we want this
   to be a MUST.
- This does not yet say anything about NFC when something being
   encoded is serialized in UTF-8. In that case, it should say
   that NFC MUST be applied when this involves conversion from
   a legacy (i.e. non-Unicode) encoding.


> > - In section 4.2 Decryption, in step 4.3, the wording 'replace' ...
> >    'by the UTF-8 encoded characters' may easily be misunderstood.
> >    After decryption, there will be a byte stream with characters
> >    encoded in UTF-8, but the replacement operation has to make
> >    sure that the appropriate character encoding conversion
> >    (transcoding) is applied. As an example, if the decrypted
> >    element or element content is inserted into a DOM, there
> >    has to be a conversion from UTF-8 to UTF-16. This should be
> >    made clear.
>
>I agree. Takeshi (and others), do we want to say, "If the document into
>which the replacement is occurring is not UTF-8, the decryptor MUST
>transcode the UTF-8 encoded characters into the target encoding." ?

This would be nice. Please add this text.


> > - There needs to be some text about security risks associated
> >    with UTF-8. Assume that somebody knows that the encrypted
> >    text is Old Italic (http://www.unicode.org/charts/PDF/U10300.pdf,
> >    no spaces or punctuation). In this case, UTF-8 uses four bytes per
> >    characters, and three of them are always the same, and the top
> >    two (or three if there are no numbers) bits of the last byte
> >    are also always the same.
>
>This is addressed by the Nonce. Don, I've moved the most of the Nonce and
>IV discussion to a new section 6.3, and give some examples (Including
>unicode) from 3.3 . What should we call that section?

I think that the Nonce can help quite a bit in some situations.
But I'm not really sure at all that it will help much in the
situation I have described. Let's assume the attacker knows
that most of the encrypted text (rather than all) is in Old
Italic. What you are saying is that if the non-Old Italic
text is at the start of the data, attacks are much more
difficult than if the non-Old Italic text is in the
middle or at the end. This may indeed be true for attacks
that are based on looking at the start of the encoded sequence.
But there are most probably also attacks that can look at
any part of the data and try to find out something about it.
In other terms, the nonce doesn't really increase the entropy,
it just conceals it.
Of course, I'm not an expert here, but I'd rather be sure.



> > - URI -> anyURI/IRI: According to the Character Model,
> >    http://www.w3.org/TR/charmod/#sec-URIs, you have to make sure
> >    that wherever you use URIs, non-ASCII characters are allowed,
> >    and that conversion to ASCII only is done as late as possible.
> >    You already have this right in the Schema, by using anyURI,
> >    but you should make it clear in the text.
>
>I'm not sure where we would do that. In xmldsig, we had a whole URI section
>of the spec:
>   http://www.w3.org/TR/2001/PR-xmldsig-core-20010820/#sec-URI
>so it make sense to specify this is completely as possible. However, xenc
>makes mores casual use of URIs with a few "this is like xmldsig Reference
>processing."  So I can't find a place where those four paragraphs on
>RFC2396+RFC2732+encoding_of_disallowed-characters, etc.

Because you use XML Schema, and XML Schema now does this when
you use the anyURI datatype, you don't have to include the
actual text.

>As you say, we're
>using anyURI, so do we still need this text in every spec that uses a URI?

No, actually not. But I guess it would be better to mention at
least once that what you mean by URI throughout the spec is
not exactly the same as the URI in RFC 2396.

Actually, I just went through the whole spec looking for the
string 'URI'. It turns up many times, but it's never defined.
There is a reference with a label 'URI' (RFC 2396), but it's
never actually referenced in the text. Also, RFC 2732 is never
referenced.


> > - In 2.2.1, 'media type URI' is mentioned, but there is neither
> >    an explanation nor a reference. In addition, it would be good
> >    to check/explain that this can include parameters (such as
> >    charset).
>
>Ok, I'll say, now, "Other alternatives include 'content' of an element, or
>an external octet sequence that is identified by a media type URI [IANA],
>such as the example in Encrypting Arbitrary Data and XML Documents
>(section 2.1.4)." Presently, we have no provision for a charset. (The IANA
>directory does not provide URIs for these distinctions.) In xmldsig we have
>both a Type (to describe a higher level aspect, likes its a particular XML
>structure) and MIMEType/Encoding which I've tried to avoid here. However,
>if people feel we should have both here as well we could revert to the
>xmldsig approach or have some other way of describing the encoding -- are
>there URIs for the types of encoding?

I don't exactly understand why you need the type. If it's just
an arbitrary octet stream, then it's just that, or isn't it.
The encryption doesn't need to know, or does it? If it's not
the encryption or decryption, but some other part at the other
end then the 'charset' parameter is about as important, or even
more important, than the mime type. There are definitely cases
where the mime type is easy to guess, but not the charset.

For the 'charset', IANA has a registry, too, but because registering
a charset is a bit more lightweight than registering a mime type,
there are no separate files for each charset. One possibility
might be to add the parameters as URI parameters, i.e.
http://www.isi.edu/in-notes/iana/assignments/media-types/text/plain;charset= 
iso-8859-1

Also, the use of IANA URIs for identifying the types is in some
way definitely how things should be, but it's not really clearly
specified, it just says 'do the same thing as in the example'.


Regards,    Martin.
Received on Tuesday, 8 January 2002 03:51:33 UTC