RE: RFC 2617 Authentication and character sets revisited

Yngve Nysaeter Pettersen [mailto:yngve@opera.com] wrote:

> I think a clear specification is needed, and I also think we need to 
> define the input values of both authentications methods such that the 
> process is unambiguous. That means that either the client must be
> able to tell the server which character set and encoding it is using 
> (RFC 2047 or a charset attribute), or the character set and encoding
> have to be fixed by the protocol.

In this case, Unicode is the character set, and UTF-8 is the encoding.

But your earlier comments reminded me of something: it can be
more complicated than that.

For example, let's consider a username like "Åke". If you simply
specify UTF-8 as the encoding, you can still run into problems.
If the client represents the initial character as U+00C5, but the
server has it stored as U+0041 U+030A (both valid unicode
representations of "Å"), then you'll end up hashing differently.
The same, of course, applies to passwords.

Fortunately, Unicode also defines normalization techniques that
allow one to ensure a consisitant representation; see annex 15
(http://www.unicode.org/reports/tr15/). I think it's pretty clear
that, for the purposes of calculating authentication, we'll want
to use one of the compatibility normalizations (KC or KD). I
beleive that KD requires less processing, so I would tend to
favor it over KC.

So, in the spirit of sending text:

   The passwd value SHOULD be normalized according to Unicode
   Normalization Form KD [ref], and encoded using UTF-8 [ref]
   for input to the hash. (Note that characters in the range
   of U+0000 to U+007F are left unaffected by Unicode
   normalization.)

Presumably, the same text (with a tweak or two) can be used to
specify username handling.

/a

Received on Monday, 1 December 2003 09:34:40 UTC