Re: RFC 2617 Authentication and character sets revisited

On Wed, 26 Nov 2003 15:23:17 -0500, Scott Lawrence <scott-http@skrb.org> 
wrote:
> I don't think I understand your example.  If my server gets a
> user="foo" and 'foo' does not appear in my database of valid users,
> then authentication has failed.

I'm not a character set expert, so I don't have any Japanese or Chinese 
examples handy, but I know that Japanese systems are using several 
different character sets.

But let us use an extreme (and unrealistic) example: Let's assume that the 
client is using US-ASCII as the default character set, while the server is 
using EBCDIC.

The username "foo" and the associated password is entered on the console 
of the machine. This means that the username and password are represented 
to the server using EBCDIC character codes, not US-ASCII.

When the client is creating the credentials it will be using US-ASCII as 
the character set, instead of EBCDIC.

The binary representation (in C-style hex) of "foo" in US-ASCII is <0x66 
0x6F 0x6F>, while it is <0x86 0x96 0x96> in EBCDIC.

Unless the server explicitly converts the recieved username from US-ASCII 
to EBCDIC (or the other way for the EBCDIC version) before using it, the 
server will not be able to get a match, despite the fact that the user 
entered "foo" when registering and when autenticating.

That was phase 1; now for phase 2: Replace "US-ASCII" with one of the 
Japanese character sets e.g. Shift-JIS, "EBCDIC" with one of the other 
japanese character sets, e.g. EUC-JP, use a Japanese username and repeat 
the above procedure.

My point is that you cannot guarantee that all steps of the authentication 
process, including the registration process, on both the client and server 
side results in the *same* binary representation of a national character, 
unless the specification clearly specifies which binary representation is 
going to be used. And in an international environment like HTTP is used 
in, the best binary representation of a string of national characters is 
the 8 bit encoding of Unicode, UTF-8.

> The username value is already covered by the existing rule for TEXT:

AFAICT (from a quick look) Apache 2.0 is not able to parse a RFC 2047 
encoded parameter (Oh, and BTW: the RFC 2047 encoding does not have a very 
good syntax for parameters, e.g. name==?a?Q?value?= , it is not without 
reason that it's been updated by RFC 2231).

AFAIK nobody are using the RFC 2047 encoding, especially not for 
authentication. Feel free to correct me if I am wrong.

Assuming that UTF-8 is not mandated for Basic username and password and 
Digest username, I would recommend that RFC 2231 encoding is recommended 
for the Digest username, instead of RFC 2047, as 2231 is better suited for 
encoding parameters, and that it is clearly stated in the RFC.

However, the problem about which binary representation is used in 
calculations MUST also be addressed (should the encoded or the decoded 
version of the credentials be used, and should they be converted to a 
common character set, if possible?). Not mandating UTF-8 will just move 
the problem around.

Come to think of it: Perhaps the *TEXT rule in RFC 2616 sec 2.2 should be 
updated to mandate UTF-8 instead of iso-8859-1? But that is probably too 
big a change to do at this time.

-- 
Sincerely,
Yngve N. Pettersen
 
********************************************************************
Senior Developer                     Email: yngve@opera.com
Opera Software ASA                   http://www.opera.com/
Phone:  +47 24 16 42 60              Fax:    +47 24 16 40 01
******************************************************************** 

Received on Saturday, 29 November 2003 11:27:32 UTC