RE: [www-international] <none> from Yoshito_Umaoka@lotus.co.jp on 2001-10-25 (www-international@w3.org from October to December 2001)

From: <Yoshito_Umaoka@lotus.co.jp>
Date: Thu, 25 Oct 2001 12:17:41 -0400
To: www-international@w3.org
Message-ID: <OFC5422079.2812E23A-ON85256AF0.00548B67-85256AF0.00596F2E@lotus.com>
>I hope this is an appropriate question for www-international. It concerns
>the behavior of HHTP authentication in a multilingual environment.

I struggled with the same issue before.

The HTTP basic authentication is defined in RFC2617.  The definitions for 
"userid" and
"password" are below.

> credentials = "Basic" basic-credentials
> basic-credentials = base64-user-pass
> base64-user-pass  = <base64 [4] encoding of user-pass, except not 
limited to 76 char/line>
> user-pass   = userid ":" password
> userid      = *<TEXT excluding ":">
> password    = *TEXT

Based on the definitions, you can use TEXT for password and TEXT excluding 
":" for
"userid" and any TEXT for "password".  RFC2617 inherits rules from RFC2616 
and
TEXT rule in RFC2616 is defined like below. 

> OCTET          = <any 8-bit sequence of data>
> CHAR           = <any US-ASCII character (octets 0 - 127)>
> UPALPHA        = <any US-ASCII uppercase letter "A".."Z">
> LOALPHA        = <any US-ASCII lowercase letter "a".."z">
> ALPHA          = UPALPHA | LOALPHA
> DIGIT          = <any US-ASCII digit "0".."9">
> CTL            = <any US-ASCII control character
                      (octets 0 - 31) and DEL (127)>
> CR             = <US-ASCII CR, carriage return (13)>
> LF             = <US-ASCII LF, linefeed (10)>
> SP             = <US-ASCII SP, space (32)>
> HT             = <US-ASCII HT, horizontal-tab (9)>
> <">            = <US-ASCII double-quote mark (34)>

> HTTP/1.1 defines the sequence CR LF as the end-of-line marker for all
> protocol elements except the entity-body (see appendix 19.3 for
> tolerant applications). The end-of-line marker within an entity-body
> is defined by its associated media type, as described in section 3.7.
>
>     CRLF           = CR LF
>
> HTTP/1.1 header field values can be folded onto multiple lines if the
> continuation line begins with a space or horizontal tab. All linear
> white space, including folding, has the same semantics as SP. A
> recipient MAY replace any linear white space with a single SP before
> interpreting the field value or forwarding the message downstream.
>
>      LWS            = [CRLF] 1*( SP | HT )
>
>  The TEXT rule is only used for descriptive field contents and values
>  that are not intended to be interpreted by the message parser. Words
>  of *TEXT MAY contain characters from character sets other than ISO-
>  8859-1 [22] only when encoded according to the rules of RFC 2047
>  [14].
>
>     TEXT           = <any OCTET except CTLs,
>                       but including LWS>

So the standard says -

1. You can use any characters defined in ISO-8859-1 for "userid" and 
"password"
2. If you want to use any other charsets other than ISO-8859-1 for 
"userid" and
  "password", you must encode the string data according to the rules 
defined in
  RFC2047 (MIME encoded word)

However, no browser support the standard as far as I know.

>It seems that the authentication data sent from the browser in response 
to a
>server request is supplied in the platform codepage of the system that 
the
>browser is running on.
>
>In other words, on Japanese windows, the username comes back in cp932, on 
a
>french windows machine, the username comes back in cp1252, on a Solaris
>machine, it comes back in whatever the platform encoding is set to..

Yes.  I found the same things.  I was also thinking about a solution 
similar to your
idea - detecting a user agent's information such as OS, browser software,
accept-language....  But I finally decided not to support "userid" and 
"password"
other than ISO-8859-1, because of next two reasons.

1. I didn't like to introduce such kinds of ambiguity in the 
authentication logic.
2. If the standard is revised someday, the hack may cause more difficult
   issues - backward compatibility vs. the standard

So my conclusion was that the authentication code should handle any 
non-ASCII bytes
(byte > 0x7f) as ISO-8859-1.  For now, it can support only Latin-1 
"userid" and 
"password", but it was the best effort.

- Yoshito Umaoka
Received on Thursday, 25 October 2001 12:18:19 UTC