Re: Last call comments on XML Encryption specs from Joseph Reagle on 2002-01-08 (xml-encryption@w3.org from January 2002)

From: Joseph Reagle <reagle@w3.org>
Date: Tue, 8 Jan 2002 18:18:33 -0500
To: Martin Duerst <duerst@w3.org>, dee3@torque.pothole.com, "Takeshi Imamura" <IMAMU@jp.ibm.com>
Cc: w3c-i18n-ig@w3.org, xml-encryption@w3.org
Message-Id: <200201082318.SAA17660@tux.w3.org>
[resulting document
  http://www.w3.org/Encryption/2001/Drafts/xmlenc-core/Overview.html
  new revision: 1.100;
]

On Tuesday 08 January 2002 03:45, Martin Duerst wrote:
> It is already addressed by the spec. But there is no corresponding
> requirement. While the requirement that one gets the same stuff
> back that one encrypted seems pretty straightforward, at least
> for character encoding issues, it might be easy to get wrong.

The question here is then what exactly is it that you are getting back? In 
the generic scenario above, you are getting back the same document (some 
context might've been lost) but definitely not the same octets.

> You can get rid of mentioning the infoset by saying that the
> following two sequences of operations must lead to the same
> document:
>
> a) 1) Encrypting part of the document
>     2) Transcoding it from one character encoding to another
>     3) Decrypting what was encrypted in step 1)
> b) 1) Encrypting part of the document
>     2) Decrypting what was encrypted in step 1)
>     3) Transcoding it from one character encoding to another

In this specific scenario, you are getting back the same octects.

So, how about if I add the following text to the requirements?

  If a document is (partially) encrpyted, transcoded, and then
  decrypted, the resulting octets must be the same as if the document
  had been (partially) encrypted, decrypted, and then transcoded.

> >In the spec we say on
> >encryption, "obtain the octets by serializing the data in UTF-8 as
> >specified in [XML]" and on decryption if the data was an XML element or
> >content its UTF8, if not its just octets. BTW: Did you have a chance to
> >look at:
> >   http://www.w3.org/Encryption/2001/Drafts/xmlenc-decrypt.html
> >There's a lot of XML/character processing involved there.
>
> I had a look at it. I haven't found anything about
> character encoding issues. Did I miss something?

No. It's just slightly tricky on decryption with: serialize the XML, wrap 
it in dummy nodes if not well formed, decrypt octets corresponding to the 
encrypted data, parse the decrypted octet stream, remove the wrapping. Just 
want to make sure we didn't miss anything!

> - This does not yet say anything about NFC when something being
>    encoded is serialized in UTF-8. In that case, it should say
>    that NFC MUST be applied when this involves conversion from
>    a legacy (i.e. non-Unicode) encoding.

Ok, in 4.1 step 3.1 (Encrypt the Data) now says just that:

If the data is an 'element' [XML, section 3] or element 'content' [XML, 
section 3.1], obtain the octets by serializing the data in UTF-8 as 
specified in [XML]. ([NFC] MUST be applied when this involves conversion 
from a legacy (i.e. non-Unicode) encoding.) Serialization MAY be done by 
the encryptor. If the encryptor does not serialize, then the application 
MUST perform the serialization.

> >I agree. Takeshi (and others), do we want to say, "If the document into
> >which the replacement is occurring is not UTF-8, the decryptor MUST
> >transcode the UTF-8 encoded characters into the target encoding." ?
>
> This would be nice. Please add this text.

Ok, done.


> > > - There needs to be some text about security risks associated
> > >    with UTF-8. Assume that somebody knows that the encrypted
> > >    text is Old Italic (http://www.unicode.org/charts/PDF/U10300.pdf,
> > >    no spaces or punctuation). In this case, UTF-8 uses four bytes per
> > >    characters, and three of them are always the same, and the top
> > >    two (or three if there are no numbers) bits of the last byte
> > >    are also always the same.
> I think that the Nonce can help quite a bit in some situations.
> But I'm not really sure at all that it will help much in the
> situation I have described. Let's assume the attacker knows
> that most of the encrypted text (rather than all) is in Old
> Italic. What you are saying is that if the non-Old Italic
> text is at the start of the data, attacks are much more
> difficult than if the non-Old Italic text is in the
> middle or at the end. This may indeed be true for attacks
> that are based on looking at the start of the encoded sequence.
> But there are most probably also attacks that can look at
> any part of the data and try to find out something about it.
> In other terms, the nonce doesn't really increase the entropy,
> it just conceals it.
> Of course, I'm not an expert here, but I'd rather be sure.

Ok, I will defer this to the crypto experts.

> >I'm not sure where we would do that. In xmldsig, we had a whole URI
> > section of the spec:
> >   http://www.w3.org/TR/2001/PR-xmldsig-core-20010820/#sec-URI
> >so it make sense to specify this is completely as possible. However,
> > xenc makes mores casual use of URIs with a few "this is like xmldsig
> > Reference processing."  So I can't find a place where those four
> > paragraphs on RFC2396+RFC2732+encoding_of_disallowed-characters, etc.
>
> Because you use XML Schema, and XML Schema now does this when
> you use the anyURI datatype, you don't have to include the
> actual text.

Ok.

> >As you say, we're
> >using anyURI, so do we still need this text in every spec that uses a
> > URI?
>
> No, actually not. But I guess it would be better to mention at
> least once that what you mean by URI throughout the spec is
> not exactly the same as the URI in RFC 2396.
>
> Actually, I just went through the whole spec looking for the
> string 'URI'. It turns up many times, but it's never defined.
> There is a reference with a label 'URI' (RFC 2396), but it's
> never actually referenced in the text. Also, RFC 2732 is never
> referenced.

Ok. I've added to section 1.3, "Versions, Namespaces, URIs, and Identifiers"

  Finally, URIs [URI] MUST abide by the [XML-Schema] anyURI type
  definition and the [XML-DSIG, 4.3.3.1 The URI Attribute] specification
  (i.e., permitted characters, character escaping, scheme support, etc.).

I'm not inclined to referenvce RFC2732; I'm not sure if its needed unless 
we wanted to make sure all W3C specs should RFC2732 as MANDATORY? 
Otherwise, I 'd prefer to keep it "orthogonal."


> > > - In 2.2.1, 'media type URI' is mentioned, but there is neither
> > >    an explanation nor a reference. In addition, it would be good
> > >    to check/explain that this can include parameters (such as
> > >    charset).
> >
> >Ok, I'll say, now, "Other alternatives include 'content' of an element,
> > or an external octet sequence that is identified by a media type URI
> > [IANA], such as the example in Encrypting Arbitrary Data and XML
> > Documents (section 2.1.4)." Presently, we have no provision for a
> > charset. (The IANA directory does not provide URIs for these
> > distinctions.) In xmldsig we have both a Type (to describe a higher
> > level aspect, likes its a particular XML structure) and
> > MIMEType/Encoding which I've tried to avoid here. However, if people
> > feel we should have both here as well we could revert to the xmldsig
> > approach or have some other way of describing the encoding -- are there
> > URIs for the types of encoding?
>
> I don't exactly understand why you need the type. If it's just
> an arbitrary octet stream, then it's just that, or isn't it.

Presently, while we do make a distincition between and Element and its 
Content we're not making a great deal of use of this from the processing 
point of view. However, it is useufl in the prose, and I feel that this 
Type will be of use in the future.

> For the 'charset', IANA has a registry, too, but because registering
> a charset is a bit more lightweight than registering a mime type,
> there are no separate files for each charset. One possibility
> might be to add the parameters as URI parameters, i.e.
> http://www.isi.edu/in-notes/iana/assignments/media-types/text/plain;chars
>et= iso-8859-1

Ok, we'll take this under consideration as another option over creating a 
Charset attribute. Are there other parameters that we MUST account for 
beyond the charset? Do we need some extensible mechanism?

> Also, the use of IANA URIs for identifying the types is in some
> way definitely how things should be, but it's not really clearly
> specified, it just says 'do the same thing as in the example'.

Also made more explicit in section 1.3:

  Finally, this specification identifies IANA registered media-types
  via the set of URIs that match:
  'http://www.isi.edu/in-notes/iana/assignments/media-types/*/* .

-- 

Joseph Reagle Jr.                 http://www.w3.org/People/Reagle/
W3C Policy Analyst                mailto:reagle@w3.org
IETF/W3C XML-Signature Co-Chair   http://www.w3.org/Signature/
W3C XML Encryption Chair          http://www.w3.org/Encryption/2001/
Received on Tuesday, 8 January 2002 18:18:48 UTC