RE: RFC 2617: Which character should be used? from Larry Masinter on 2003-04-17 (ietf-http-wg@w3.org from April to June 2003)

From: Larry Masinter <LMM@acm.org>
Date: Wed, 16 Apr 2003 21:56:13 -0700
To: "'Alexey Melnikov'" <mel@messagingdirect.com>, <yngve@opera.com>
Cc: <ietf-http-wg@w3.org>
Message-ID: <001601c3049d$ad20cb50$6ace8642@MASINTERPAD>
The topic was discussed in the HTTP working group in September 1998.
http://lists.w3.org/Archives/Public/ietf-http-wg-old/1998SepDec/0040.htm
l
and the end of the dialog was:


   How about we restrict user-ids and typed-in passwords to US-ASCII for
now,
   declare encoding of non-ASCII characters in those fields undefined
but
   explicitly forbid use of localized charsets (e.g., ISO-8859-1)?  Then
we
   can amend it to use UTF-8 later with a spec that progresses
separately
   on the standards track.


However, this apparently didn't make it into the spec. I think that
as far as 'errata', what's missing is a pointer to the discussion and
this particular 'resolution'.

I admit that it's not a great resolution of the issue (we basically just
avoided it), but that's at least a baseline: HTTP isn't as functional
as one would like in this area.

I'm less sanguine about mandating new behavior at this point, for
the same reasons given in 1998. So I'm not sure there is a "fix",
unfortunately.  Clients can do things that servers won't understand,
and servers can expect clients to do things that they can't.
Non-ASCII user names will be unreliable until new behavior
is proposed and accepted by the community.

The notion (in 1998) is that someone would write a spec that would
progress independently on standards track (or at least thru Proposed
and Draft) so that we can update HTTP to full standard some day.

Larry
-- 
http://larry.masinter.net


-----Original Message-----
From: ietf-http-wg-request@w3.org [mailto:ietf-http-wg-request@w3.org]
On Behalf Of Alexey Melnikov
Sent: Tuesday, April 15, 2003 5:12 PM
To: yngve@opera.com
Cc: ietf-http-wg@w3.org
Subject: Re: RFC 2617: Which character should be used?



Yngve Nysaeter Pettersen wrote:

> Hi,

Hi,

> My name is Yngve N. Pettersen, I am a developer at Opera Software ASA,
the
> company producing the Opera browser. One of my areas of responsibility
is
> our HTTP protocol support.
>
> Some time ago, while implementing Opera's support for international
> character sets I discovered that RFC 2617 did not specify the
character set
> to be used when encoding the username and password arguments for Basic
and
> Digest authentication.
>
> Given that BCP 18/RFC 2277 strongly encouraged UTF-8 support in
protocols,
> and that it may be impossible to determine the server's preferred
> characterset, among other reasons, I decided to use UTF-8 as the
> characterset when encoding the username and password before generating
the
> authentication strings.
>
> Recently we received a report concerning problems with this way of
> generating authentication strings (apparantly other clients does not
> convert national characters in Western European languages, at least, I
> don't know how they treat Asian languages), and while researching the
> current state of the protocol, I noticed that the current errata does
not
> address this point.
>
> I would therefore like to suggest that an item specifying which
character
> set should be used when generating Basic and Digest authentication
strings
> is added to the errata.
>
> My suggestion is that UTF-8 is selected as the character set used to
encode
> the username and password values when creating the "user-pass" string
(sec.
> 2) and the "username-value" and "passwd" strings in sec. 3.2.2. It
might
> also be an idea to specify the same for other text attributes as well.
>
> As mentioned above BCP 18 indicates UTF-8 is the preferred charset for
> protocols.
>
> Additionally, I believe it would be very difficult to create a
foolproof
> guessing method that would decide the charset based on such things as
the
> charset of the authentication challenge response body, toplevel domain
of
> the server, or the same from the referrer (if any), or the character
set
> used on the client's computer (which may not match what is used on the
> server). As an example, the challenge may use a default message in
English,
> while passwords and documents are encoded in a Japanese character set.
>
> I think the best way of avoiding (any further) ambiguities is to
specify a
> single character set that MUST be used, and UTF-8 is the character set
> recommended by BCP 18.

Although I am not big expert on this, but here is some information for
you.

You are right, RFC 2617 doesn't specify any character set. RFC 2617 is a
revision of RFC 2069 which predates RFC 2277. So I suspect that when the
RFC
2069 was revised, nobody noticed this issue.

RFC 2831 (Using Digest Authentication as a SASL Mechanism) which is
based on
RFC 2617 had to deal with this as well. RFC 2831 has a phrase "The
directive is
needed for backwards compatibility with HTTP Digest, which only supports
ISO
8859-1." which suggests that ISO 8859-1 is the default for HTTP. RFC
2831 had
to add a new "charset" directive and a complex rule to convert UTF-8
usernames/passwords [that can be fully expressed as ISO 8859-1] to ISO
8859-1.
This is a mess :-(.

So, although I tend to agree with your choice to use UTF-8, however it
seems
that the reality is a bit more complicated than that.

Regards,
Alexey Melnikov
__________________________________________
R & D, ACI Worldwide/MessagingDirect
Watford, UK

Work Phone: +44 1923 81 2877
Home Page: http://orthanc.ab.ca/mel
IETF standard
related pages: http://orthanc.ab.ca/mel/devel/Links.html

I speak for myself only, not for my employer.
__________________________________________
Received on Thursday, 17 April 2003 00:56:28 UTC