Re: Again, confusing 8.1

From: Jose Kahan <jose.kahan@w3.org>
Date: Fri, 17 Dec 2004 19:49:49 +0100
To: Ed Simon <edsimon@xmlsec.com>
Cc: www-xkms@w3.org
Message-ID: <20041217184949.GN1217@inrialpes.fr>
Hi Ed,

On Fri, Dec 17, 2004 at 10:28:03AM -0500, Ed Simon wrote:
> I generally agree with Jose and Guillermo's recommendations EXCEPT for the
> one about filtering UTF-8 characters outside the ASCII32-127.  Unless, there
> is a verifiable case to be made for disallowing non-Latin characters (eg.
> Korean pass phrases) I would not include that possibility.  Ultimately, the
> pass phrase is just '1's and '0's and all we are doing is saying how a
> human-readable/writable phrase can be consistently converted into binary;
> that MAY not always mean the end device has to understand Unicode, just
> binary.  (I say MAY because I'm not a mobile device expert, I just want
> someone who is to say non-ASCII is a problem before we try to accommodate
> it.)

I agree with you that we should try to support all characters.  In fact, 
after re-reading this part of the spec, I am confused as to
how this shared-string will be used and, in that case,  I think 
that my previous message was also confusing.

If the goal of 8.1 is giving some advice about how to make shared 
strings that can be read over the phone, I didn't understand it as such
and more text should be added to explain this, or even a different

If 8.1 is giving an algorithm for converting a shared string (regardless
of its content) into a key, using one-way functions (i.e., we won't be 
able to find the shared string from this key), then we can use any 
algorithm to generate it.  This can include removing spaces, 
punctuation, control characters, if any of these represent some
cryptographic weakness (I let Stephen comment on this)...  As you 
pointed out, the shared string is just a sequence of bytes, so it's not really 
important how they are converted. The result is being coded into BASE

If when converting the shared string into a key, there is a risk that
the string has to be processed by some tools that may not
understand non-ascii charsets or control characters, it does make sense to
me to remove control characters and translate the string into UTF-8 first.

I'll ask my colleagues to comment on tools having problems handling
non-utf 8 characters (which I think can happen).

I'll also ask my colleagues to comment on handling of UTF-8 in mobile
devices, just to be on the safe side. If we can assume that those
devices always work with UTF-8 or know how to translate their local
charset into UTF-8, I think we would be all set.

> I would drop mention of "XML Encoding" and call it "UTF-8" encoding; not
> only do I think this is sensible from the outset but it also gets rid of
> trying to process XMLese like entities etc.  I confess that I have one
> question which is I am not absolutely sure (eg. due to combining sequences)
> there is always one and only one binary representation for every unique
> UTF-8-encoded pass phrase; Jose, can you verify that with a W3C UTF-8
> expert.  A follow-up question would be whether we could use rules to
> canonicalise the UTF-8 (eg. do not use combining characters) if there is
> more than one binary representation.

I'm not sure if I understand what you mean by combining sequences in
this context.

Let me quote first a slighly modified man 7 utf-8:

The  Unicode  3.0  character set occupies a 16-bit code space.  UTF-8 is
a encoding
that allows to represent characters as a sequence of octets, which can
then be, e.g.,  transmitted on mail without problem. Unicode Characters
from 0 to 127 are encoded directly, without no changes.

Unicode characters beyond 127 are encoded using a multibyte sequence
consisting only of bytes in the range 0x80 to 0xfd,  so no ASCII byte
appear  as  part  of another character and there are no problems with
e.g. '\0' or '/'.  mx00 to 0x7f

Every character in Unicode space can be represented by one or more
bytes in UTF-8. There can be no collision between different characters,
provided you keep all the bits. And you can combine characters that come
from different charsets. If you code a pass-phrase into UTF-8,
you should be able to decode each character back into its original charset.
(I'm not sure if using charset here is the best term... maybe glyph
would be better).

If this is what you refer as combining sequences (of different
characters), there's no problem to do so in UTF-8.


