Re: RFC 2617: Which character should be used? from Alexey Melnikov on 2003-04-16 (ietf-http-wg@w3.org from April to June 2003)

From: Alexey Melnikov <mel@messagingdirect.com>
Date: Tue, 15 Apr 2003 18:11:45 -0600
To: yngve@opera.com
CC: ietf-http-wg@w3.org
Message-ID: <3E9C9FC1.31121343@messagingdirect.com>
Yngve Nysaeter Pettersen wrote:

> Hi,

Hi,

> My name is Yngve N. Pettersen, I am a developer at Opera Software ASA, the
> company producing the Opera browser. One of my areas of responsibility is
> our HTTP protocol support.
>
> Some time ago, while implementing Opera's support for international
> character sets I discovered that RFC 2617 did not specify the character set
> to be used when encoding the username and password arguments for Basic and
> Digest authentication.
>
> Given that BCP 18/RFC 2277 strongly encouraged UTF-8 support in protocols,
> and that it may be impossible to determine the server's preferred
> characterset, among other reasons, I decided to use UTF-8 as the
> characterset when encoding the username and password before generating the
> authentication strings.
>
> Recently we received a report concerning problems with this way of
> generating authentication strings (apparantly other clients does not
> convert national characters in Western European languages, at least, I
> don't know how they treat Asian languages), and while researching the
> current state of the protocol, I noticed that the current errata does not
> address this point.
>
> I would therefore like to suggest that an item specifying which character
> set should be used when generating Basic and Digest authentication strings
> is added to the errata.
>
> My suggestion is that UTF-8 is selected as the character set used to encode
> the username and password values when creating the "user-pass" string (sec.
> 2) and the "username-value" and "passwd" strings in sec. 3.2.2. It might
> also be an idea to specify the same for other text attributes as well.
>
> As mentioned above BCP 18 indicates UTF-8 is the preferred charset for
> protocols.
>
> Additionally, I believe it would be very difficult to create a foolproof
> guessing method that would decide the charset based on such things as the
> charset of the authentication challenge response body, toplevel domain of
> the server, or the same from the referrer (if any), or the character set
> used on the client's computer (which may not match what is used on the
> server). As an example, the challenge may use a default message in English,
> while passwords and documents are encoded in a Japanese character set.
>
> I think the best way of avoiding (any further) ambiguities is to specify a
> single character set that MUST be used, and UTF-8 is the character set
> recommended by BCP 18.

Although I am not big expert on this, but here is some information for you.

You are right, RFC 2617 doesn't specify any character set. RFC 2617 is a
revision of RFC 2069 which predates RFC 2277. So I suspect that when the RFC
2069 was revised, nobody noticed this issue.

RFC 2831 (Using Digest Authentication as a SASL Mechanism) which is based on
RFC 2617 had to deal with this as well. RFC 2831 has a phrase "The directive is
needed for backwards compatibility with HTTP Digest, which only supports ISO
8859-1." which suggests that ISO 8859-1 is the default for HTTP. RFC 2831 had
to add a new "charset" directive and a complex rule to convert UTF-8
usernames/passwords [that can be fully expressed as ISO 8859-1] to ISO 8859-1.
This is a mess :-(.

So, although I tend to agree with your choice to use UTF-8, however it seems
that the reality is a bit more complicated than that.

Regards,
Alexey Melnikov
__________________________________________
R & D, ACI Worldwide/MessagingDirect
Watford, UK

Work Phone: +44 1923 81 2877
Home Page: http://orthanc.ab.ca/mel
IETF standard
related pages: http://orthanc.ab.ca/mel/devel/Links.html

I speak for myself only, not for my employer.
__________________________________________
Received on Tuesday, 15 April 2003 20:08:26 UTC