Proposed stringprep algorithn from Section 8.1

Hello folks,

After some discussions and feedback, here's a revised proposal for the
string2key algorithm in Section 8.1. This proposal takes into account
Ed Simon's authentication device remarks and the I18N string concerns.
I checked with Martin Duerst from the W3C I18N WG and he says the I18N 
part is fine with him.

See my notes also as to why I argue against adding case, space, and
punctuation folding into this algorithm.




[329a] The symmetric key data MAY be binary data (as from an
authentication device) or as a human-readable value (numeric,
alphabetic, or both).  When it is binary data, no transformation is
needed; the data can be used directly as input to the MAC function.

[329b]When the symmetric key data is human-readable, it may be issued to
a human user in the form of a text string which may in some circumstances
be read over a telephone line. It may be randomly generated and represent
an underlying numeric value, or may be a password or phrase. In either
case, it is often convenient to present the value to the human user as a 
string of characters in a character set the particular user 
understands.  To limit the possibility of human error in processing the
symmetric key data, and to provide a canonical binary representation, 
the string text must be compliant to the SASLprep stringprep profile
for user names and passwords[1].
[329c]The algorithm for canonicalizing a string-text before feeding it
to the MAC function is the following:

  1. Convert the input string to an Unicode encoding 
     Removes the  US-ASCII and ISO-LATIN-1 limitations! Let's
     a user type a password phrase that s/he can remember with ease
     or that's easy to type with his/her keyboard configuration.

  2. Verify that the input string is compliant to the SASLprep 
     stringprep profile for user names and passwords [1]. Refuse
     the string otherwise.

   This operation consists of mapping and normalizing the characters in 
   the string, and checking that it doesn't have any forbidden characters.
   In particular, there's no folding of multiple spaces or of
   case. Punctuation symbols are not removed either. Tabs are 
   control characters and thus are considered to be forbidden.

  3. Encode the result into UTF-8
  4. Apply the MAC functions


The implication of adopting this algorithm means that we have to
regenrate some of the XKRSS examples and the related converted strings
given in Appendix C.

For developers, if you stick to US-ASCII ranges 32-126, then you don't
have to change much, except to remove the upper-lower case convertion
and space suppresion. US-ASCII maps one to one in UTF-8. Your
application should make sure you don't use characters outside the
32-126 range.

If you do want to support Unicode, there are system libraries in C,
Java, Perl, Win32 that do so already and can help you. Once you code
your strings into Unicode, applying the SASLprep stringprep profile
means going thru some tables and checking if a character belongs to
them or not. There is also a library that already provides this
checking. See my notes.



1. My case against case folding

Summary: I don't see a reason why we should do case or space folding.

In 2), SASLprep doesn't propose case folding. I still don't agree with
imposing case folding (and reducing the password space) unless there
is a good reason. It seems it's a tradition for Internet applications
to work with small case strings. This makes sense to me if we are
talking about DNS domain names. Some cases where this tradition has
never been applied:

 - HTTP Authentication protocol [2]
 - Web server passwords in Apache are not caseless.

A case where case folding is not enforced, but has caused problems:

 - URLs of pages. For example, we have to run a special Apache module
   in our servers to take care of this. However, this is not done
   transparently; if the user types a password using wrong case, the
   server returns a redirection to the correctly spelled page.

I've asked Thomas Roesller, who recently joined W3C as a security
consultant, to give additional feedback on why doing case folding here
is not appropriate. He told me he'll do so once he finished reading
the spec.

If the WG still wants to push folding, it should give a valid reason
why it has to be done. Note that some languages are caseless too.

2. About how to apply the SASLprep stringprep profile

Summary: I think it is better to have the application refuse
strings that have forbidden characters rather than just 
silently removing them.

There are some parts of the profile that can be done transparently
from the user like mapping and normalizing characters. However, then
we reach a step called "prohibited output" (section 2.3). There are
two ways to approach this:

1. Have the application complain that the string has prohibited
   characters (mentioning the relevant ones) and ask the user to try

2. Have the application silently discard the prohibited characters.

In my personal experience, having the application silently discard
characters is wrong for this context. The user may think that he typed
something and that wasn't the case.  If the application prompts the
user saying that the string has invalid characters, the user knows
that s/he did something wrong, s/he may try again. 

The only case where I know this works this way is the Unix standard
passwords which only support 8 characters and drop the rest.  But it
does support control characters. It is recommended not to use control
characters because it may be hard typing them over a telnet
connection, but this is not enforced. The user is free to type what
s/he wants.

What I would propose is that we stick to the SASLprep stringprep
profile and just say that the application MUST not accept strings
that have illegal characters and recommend that the application returns
an error message in such cases. How this is implemented, is out of scope
of the XKMS spec.

From the architecture point of view, we assume we have a string that
corresponds to the SASLprep profile. From the implementation point of
view, by refusing string messages that don't correspond to this
profile, we remove the risk of having a user be able to type a
password in some devices and not in others, without knowing why it
works sometimes.

3. Some programming tools

   libiconv [3]. This library provides an iconv() implementation, for
   use on systems which don't have one, or whose implementation cannot
   convert from/to Unicode.

   libidn [4].

   GNU Libidn is an implementation of the Stringprep, Punycode and
   IDNA specifications defined by the IETF Internationalized Domain
   Names (IDN) working group, used for internationalized domain names.
   idn - Internationalized Domain Names command line tool

Both of these tools also provide command line programs (at least in my
debian sarge box) called iconv and idn. They can be used to get
familiarized with stringprep and unicode.

4. Some examples

- Converting a file from ISO-8859-1 to UTF-8:
  iconv -c -f ISO-8859-1 -t UTF-8  filename
  (substitute the first parameter for the origin charset).

Canonicalizing a string according to the SASLprep stringprep profile
(assuming a file has the string in UTF-8)

  cat filename | CHARSET=UTF-8 idn --quiet -s -p SASLprep

idn returns the canonicalized string (in UTF-8) or returns error if
there are forbidden characters inside it. 





(same string, same characters under UTF-8)

Likewise, converting:


Returns the same string:


Note that there was no folding or discarding of characters in both
of these examples.


Received on Wednesday, 22 December 2004 16:34:57 UTC