Re: RFC 2617: Which character set should be used?

Hi Scott,

[note: I corrected the subject]

On 16 Apr 2003 08:20:15 -0400, Scott Lawrence <scott-http@skrb.org> wrote:
>
> Yngve Nysaeter Pettersen <yngve@opera.com> writes:
>
>> My suggestion is that UTF-8 is selected as the character set used to 
>> encode
>> the username and password values when creating the "user-pass" string
>> (sec. 2) and the "username-value" and "passwd" strings in sec. 3.2.2. It
>> might also be an idea to specify the same for other text attributes as 
>> well.
>
> I just took a look at the spec to try to come up with specific
> language for this.
>
> Section 3.2.2.2 A1 add:
>
> The passwd value used should be encoded using UTF-8.
>
> I don't think it's an issue for the user-pass string or
> username-value, since these are just literals that are passed in the
> clear to the server anyway.  Can't the server just use them as is?

I'm afraid not.

Remember, the server must not just be able to perform calculations using 
the password, it must also be able to look up the appropriate username 
entry in its database. If the client and the server are using different 
character sets in any phase of creating, updating and referencing this 
database there will be no match.

However, I've just noticed that RFC 2616 actually do comment on this in 
section 2.2, and requires RFC 2047 encoding for any TEXT not using iso- 
8859-1 encoding.

The question then becomes: Should the errata of RFC 2617 override that 
requirement and mandate UTF-8, or should an extension to the current header 
methods be formulated, or should completely new authentication methods be 
formulated that will handle UTF-8 usernames/passwords?

One way of overriding that section would be to change the defintions of 
usernames and passwords to using OCTET (minus control characters and 
special characters) instead of TEXT. Something similar was proposed in the 
thread referenced by Larry Masinter.

Personally I'd prefer to override RFC 2616 sec 2.2 for RFC 2617 
credentials, as I think BCP 18 should be the guideline. If that is not 
possible I'd like to avoid RFC 2047 syntax (E.g: Which charset should be 
used for the password in digest authentication, and how do we tell the 
server?).

I can think of several alternatives if UTF-8 cannot be made mandatory:

Alternative 1: Specify that if all characters in both the username and 
password is in the iso-8859-1 charset, then iso-8859-1 can be used, in all 
other cases utf-8 is used. This will probably lead to some 
username/password collisions; I do not know how serious this will be.

Alternative 2: Extend RFC 2617 with two new methods (e.g.) "Basic8" and 
"Digest8", with mostly the same syntax as the present "Basic" and "Digest" 
methods, but with the requirement that username and password is encoded in 
UTF-8. However, this will require the server to send one extra header for 
each method it supports, and would probably need to be specified separately 
as an RFC.

Alternative 3: Extend the syntax of "Basic" and "Digest" authentication 
headers with a "utf-8" parameter which, when included in the server's 
challenge, indicate that the server understands UTF-8 encoded usernames and 
passwords. When a UTF-8 enable client sees this parameter it can then 
encode the username and password in UTF-8 and add a utf-8 parameter to the 
authorization header it sends to the server to indicate that the 
authorization is in UTF-8.

Examples:

  WWW-Authenticate: Basic realm="realm", utf-8
  WWW-Authenticate: Digest realm="realm", <digest parameters>, utf-8

  Authorization: Basic <basic-credentials>, utf-8
  Authorization: Digest <digest-response>, utf-8

By using a utf-8 parameter instead of a charset parameter, it's possible to 
limit the charset permutatitons the client and the server have to be able 
to handle.

However, given that the A1 value for Digest authentication may be 
calculated in advance and by a thirdparty server, that means that two A1 
values must be prepared, and distributed, when non-US-ASCII 
usernames/passwords are used (Come to think of it, this will also be the 
case if utf-8 is mandated, at least in a transition phase).

Personally, as mentioned, I prefer making utf-8 mandatory, but if 
alternative specifications are needed I think alternatives 1 and 3 are the 
most acceptable of the alternatives above.

-- 
Sincerely,
Yngve N. Pettersen
 
********************************************************************
Senior Developer                     Email: yngve@opera.com
Opera Software ASA                   http://www.opera.com/
Phone:  +47 24 16 42 51              Fax:    +47 24 16 40 01
******************************************************************** 

Received on Sunday, 20 April 2003 23:03:52 UTC