RE: Again, confusing 8.1 from Tim Bond on 2004-12-20 (www-xkms@w3.org from December 2004)

From: Tim Bond <tim.bond@webmethods.com>
Date: Mon, 20 Dec 2004 11:24:04 -0500
To: www-xkms@w3.org
Message-ID: <5B10E50E14A4594EB1B5566B69AD94070836A132@maileast>
Hello,

I had a conversation regarding entering and encoding of passwords with a
co-worker Addison Phillips.  He sent me the following . . .

-- Tim

Tim Bond
Security Architect
webMethods, Inc.



Hi Tim,

Of course passwords have I18N implications. Historically most systems have
limited the use of non-ASCII in passwords.

Basically, languages like Japanese that use an IME (input method editor) to
compose characters from input won't work well for passwords. That's because
they expose the characters in the clear (or you can't choose the right
character from a list). This is bad. So most systems diable IMEs in their
password fields by default.

However, any direct input characters (which may be nearly anything in
Unicode) can be input directly from some keyboard somewhere. These can go
into a password field and require special considerations to support
reliably.

The problem is the relationship of the character string the user typed to
the byte array that is the security token. In a distributed system where the
login is presented on a device remote from the authenticating system, the
conversion from keycodes to bytes is usually done using the local default
encoding. If this is different from the character encoding used by the
authenticating system the login fails (mystifying the user). Browser
Basic-Auth is an egregious example of this.

Net result: it makes sense to define a single universal character encoding
for use on both ends. Unicode UTF-8 is a good choice for this because it is
compact for ASCII passwords (the majority of passwords everywhere), but can
support any non-ASCII value and does so in a predictable manner. It is a
variable width encoding (with up to 4 bytes per character) and this may be a
factor for systems that limit the buffer length of the token.

The resulting token can be stored in a Transfer Encoding such as Base64. If
you wish to store passwords in the clear (why?) then XML provides mechanisms
for encoding characters either directly or as entity values. The Character
Model for the World Wide Web discusses this ad nauseum
(http://www.w3.org/TR/charmod). Transfer Encodings encode bytes. The problem
is knowing how to interpret bytes outside the ASCII range or on non-ASCII
systems (EBCDIC). Using a defined single character encoding is the way to go
for this.

You might want to refer the folks below to the W3C I18N WG and the
www-international mailing list. The unicode mailing list
(unicode@unicode.org) is another good resource.

Hope this helps. Please feel free to forward this message if it helps.

Addison

Addison P. Phillips
Director, Globalization Architecture
http://www.webMethods.com

Chair, W3C Internationalization Working Group
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.

-----Original Message-----
From: www-xkms-request@w3.org [mailto:www-xkms-request@w3.org] On Behalf Of
Jose Kahan
Sent: Friday, December 17, 2004 1:50 PM
To: Ed Simon
Cc: www-xkms@w3.org
Subject: Re: Again, confusing 8.1

Hi Ed,

On Fri, Dec 17, 2004 at 10:28:03AM -0500, Ed Simon wrote:
> 
> I generally agree with Jose and Guillermo's recommendations EXCEPT for the
> one about filtering UTF-8 characters outside the ASCII32-127.  Unless,
there
> is a verifiable case to be made for disallowing non-Latin characters (eg.
> Korean pass phrases) I would not include that possibility.  Ultimately,
the
> pass phrase is just '1's and '0's and all we are doing is saying how a
> human-readable/writable phrase can be consistently converted into binary;
> that MAY not always mean the end device has to understand Unicode, just
> binary.  (I say MAY because I'm not a mobile device expert, I just want
> someone who is to say non-ASCII is a problem before we try to accommodate
> it.)

I agree with you that we should try to support all characters.  In fact, 
after re-reading this part of the spec, I am confused as to
how this shared-string will be used and, in that case,  I think 
that my previous message was also confusing.

If the goal of 8.1 is giving some advice about how to make shared 
strings that can be read over the phone, I didn't understand it as such
and more text should be added to explain this, or even a different
subsection.

If 8.1 is giving an algorithm for converting a shared string (regardless
of its content) into a key, using one-way functions (i.e., we won't be 
able to find the shared string from this key), then we can use any 
algorithm to generate it.  This can include removing spaces, 
punctuation, control characters, if any of these represent some
cryptographic weakness (I let Stephen comment on this)...  As you 
pointed out, the shared string is just a sequence of bytes, so it's not
really 
important how they are converted. The result is being coded into BASE
64.

If when converting the shared string into a key, there is a risk that
the string has to be processed by some tools that may not
understand non-ascii charsets or control characters, it does make sense to
me to remove control characters and translate the string into UTF-8 first.

I'll ask my colleagues to comment on tools having problems handling
non-utf 8 characters (which I think can happen).

I'll also ask my colleagues to comment on handling of UTF-8 in mobile
devices, just to be on the safe side. If we can assume that those
devices always work with UTF-8 or know how to translate their local
charset into UTF-8, I think we would be all set.

> I would drop mention of "XML Encoding" and call it "UTF-8" encoding; not
> only do I think this is sensible from the outset but it also gets rid of
> trying to process XMLese like entities etc.  I confess that I have one
> question which is I am not absolutely sure (eg. due to combining
sequences)
> there is always one and only one binary representation for every unique
> UTF-8-encoded pass phrase; Jose, can you verify that with a W3C UTF-8
> expert.  A follow-up question would be whether we could use rules to
> canonicalise the UTF-8 (eg. do not use combining characters) if there is
> more than one binary representation.

I'm not sure if I understand what you mean by combining sequences in
this context.

Let me quote first a slighly modified man 7 utf-8:

<quote>
The  Unicode  3.0  character set occupies a 16-bit code space.  UTF-8 is
a encoding
that allows to represent characters as a sequence of octets, which can
then be, e.g.,  transmitted on mail without problem. Unicode Characters
from
from 0 to 127 are encoded directly, without no changes.

Unicode characters beyond 127 are encoded using a multibyte sequence
consisting only of bytes in the range 0x80 to 0xfd,  so no ASCII byte
can
appear  as  part  of another character and there are no problems with
e.g. '\0' or '/'.  mx00 to 0x7f
</quote>

Every character in Unicode space can be represented by one or more
bytes in UTF-8. There can be no collision between different characters,
provided you keep all the bits. And you can combine characters that come
from different charsets. If you code a pass-phrase into UTF-8,
you should be able to decode each character back into its original charset.
(I'm not sure if using charset here is the best term... maybe glyph
would be better).

If this is what you refer as combining sequences (of different
characters), there's no problem to do so in UTF-8.

-jose
Received on Tuesday, 21 December 2004 03:21:27 UTC