- From: Tim Bond <tim.bond@webmethods.com>
- Date: Mon, 20 Dec 2004 11:24:04 -0500
- To: www-xkms@w3.org
Hello, I had a conversation regarding entering and encoding of passwords with a co-worker Addison Phillips. He sent me the following . . . -- Tim Tim Bond Security Architect webMethods, Inc. Hi Tim, Of course passwords have I18N implications. Historically most systems have limited the use of non-ASCII in passwords. Basically, languages like Japanese that use an IME (input method editor) to compose characters from input won't work well for passwords. That's because they expose the characters in the clear (or you can't choose the right character from a list). This is bad. So most systems diable IMEs in their password fields by default. However, any direct input characters (which may be nearly anything in Unicode) can be input directly from some keyboard somewhere. These can go into a password field and require special considerations to support reliably. The problem is the relationship of the character string the user typed to the byte array that is the security token. In a distributed system where the login is presented on a device remote from the authenticating system, the conversion from keycodes to bytes is usually done using the local default encoding. If this is different from the character encoding used by the authenticating system the login fails (mystifying the user). Browser Basic-Auth is an egregious example of this. Net result: it makes sense to define a single universal character encoding for use on both ends. Unicode UTF-8 is a good choice for this because it is compact for ASCII passwords (the majority of passwords everywhere), but can support any non-ASCII value and does so in a predictable manner. It is a variable width encoding (with up to 4 bytes per character) and this may be a factor for systems that limit the buffer length of the token. The resulting token can be stored in a Transfer Encoding such as Base64. If you wish to store passwords in the clear (why?) then XML provides mechanisms for encoding characters either directly or as entity values. The Character Model for the World Wide Web discusses this ad nauseum (http://www.w3.org/TR/charmod). Transfer Encodings encode bytes. The problem is knowing how to interpret bytes outside the ASCII range or on non-ASCII systems (EBCDIC). Using a defined single character encoding is the way to go for this. You might want to refer the folks below to the W3C I18N WG and the www-international mailing list. The unicode mailing list (unicode@unicode.org) is another good resource. Hope this helps. Please feel free to forward this message if it helps. Addison Addison P. Phillips Director, Globalization Architecture http://www.webMethods.com Chair, W3C Internationalization Working Group http://www.w3.org/International Internationalization is an architecture. It is not a feature. -----Original Message----- From: www-xkms-request@w3.org [mailto:www-xkms-request@w3.org] On Behalf Of Jose Kahan Sent: Friday, December 17, 2004 1:50 PM To: Ed Simon Cc: www-xkms@w3.org Subject: Re: Again, confusing 8.1 Hi Ed, On Fri, Dec 17, 2004 at 10:28:03AM -0500, Ed Simon wrote: > > I generally agree with Jose and Guillermo's recommendations EXCEPT for the > one about filtering UTF-8 characters outside the ASCII32-127. Unless, there > is a verifiable case to be made for disallowing non-Latin characters (eg. > Korean pass phrases) I would not include that possibility. Ultimately, the > pass phrase is just '1's and '0's and all we are doing is saying how a > human-readable/writable phrase can be consistently converted into binary; > that MAY not always mean the end device has to understand Unicode, just > binary. (I say MAY because I'm not a mobile device expert, I just want > someone who is to say non-ASCII is a problem before we try to accommodate > it.) I agree with you that we should try to support all characters. In fact, after re-reading this part of the spec, I am confused as to how this shared-string will be used and, in that case, I think that my previous message was also confusing. If the goal of 8.1 is giving some advice about how to make shared strings that can be read over the phone, I didn't understand it as such and more text should be added to explain this, or even a different subsection. If 8.1 is giving an algorithm for converting a shared string (regardless of its content) into a key, using one-way functions (i.e., we won't be able to find the shared string from this key), then we can use any algorithm to generate it. This can include removing spaces, punctuation, control characters, if any of these represent some cryptographic weakness (I let Stephen comment on this)... As you pointed out, the shared string is just a sequence of bytes, so it's not really important how they are converted. The result is being coded into BASE 64. If when converting the shared string into a key, there is a risk that the string has to be processed by some tools that may not understand non-ascii charsets or control characters, it does make sense to me to remove control characters and translate the string into UTF-8 first. I'll ask my colleagues to comment on tools having problems handling non-utf 8 characters (which I think can happen). I'll also ask my colleagues to comment on handling of UTF-8 in mobile devices, just to be on the safe side. If we can assume that those devices always work with UTF-8 or know how to translate their local charset into UTF-8, I think we would be all set. > I would drop mention of "XML Encoding" and call it "UTF-8" encoding; not > only do I think this is sensible from the outset but it also gets rid of > trying to process XMLese like entities etc. I confess that I have one > question which is I am not absolutely sure (eg. due to combining sequences) > there is always one and only one binary representation for every unique > UTF-8-encoded pass phrase; Jose, can you verify that with a W3C UTF-8 > expert. A follow-up question would be whether we could use rules to > canonicalise the UTF-8 (eg. do not use combining characters) if there is > more than one binary representation. I'm not sure if I understand what you mean by combining sequences in this context. Let me quote first a slighly modified man 7 utf-8: <quote> The Unicode 3.0 character set occupies a 16-bit code space. UTF-8 is a encoding that allows to represent characters as a sequence of octets, which can then be, e.g., transmitted on mail without problem. Unicode Characters from from 0 to 127 are encoded directly, without no changes. Unicode characters beyond 127 are encoded using a multibyte sequence consisting only of bytes in the range 0x80 to 0xfd, so no ASCII byte can appear as part of another character and there are no problems with e.g. '\0' or '/'. mx00 to 0x7f </quote> Every character in Unicode space can be represented by one or more bytes in UTF-8. There can be no collision between different characters, provided you keep all the bits. And you can combine characters that come from different charsets. If you code a pass-phrase into UTF-8, you should be able to decode each character back into its original charset. (I'm not sure if using charset here is the best term... maybe glyph would be better). If this is what you refer as combining sequences (of different characters), there's no problem to do so in UTF-8. -jose
Received on Tuesday, 21 December 2004 03:21:27 UTC