RE: RFC 2617 Authentication and character sets revisited

Yngve Nysaeter Pettersen [mailto:yngve@opera.com] wrote:

> I think a clear specification is needed, and I also think we need to 
> define the input values of both authentications methods such that the 
> process is unambiguous. That means that either the client must be
> able to tell the server which character set and encoding it is using 
> (RFC 2047 or a charset attribute), or the character set and encoding
> have to be fixed by the protocol.

In this case, Unicode is the character set, and UTF-8 is the encoding.

But your earlier comments reminded me of something: it can be
more complicated than that.

For example, let's consider a username like "┼ke". If you simply
specify UTF-8 as the encoding, you can still run into problems.
If the client represents the initial character as U+00C5, but the
server has it stored as U+0041 U+030A (both valid unicode
representations of "┼"), then you'll end up hashing differently.
The same, of course, applies to passwords.

Fortunately, Unicode also defines normalization techniques that
allow one to ensure a consisitant representation; see annex 15
(http://www.unicode.org/reports/tr15/). I think it's pretty clear
that, for the purposes of calculating authentication, we'll want
to use one of the compatibility normalizations (KC or KD). I
beleive that KD requires less processing, so I would tend to
favor it over KC.

So, in the spirit of sending text:

   The passwd value SHOULD be normalized according to Unicode
   Normalization Form KD [ref], and encoded using UTF-8 [ref]
   for input to the hash. (Note that characters in the range
   of U+0000 to U+007F are left unaffected by Unicode
   normalization.)

Presumably, the same text (with a tweak or two) can be used to
specify username handling.

/a

Received on Monday, 1 December 2003 09:34:40 UTC