- From: Jose Kahan <jose.kahan@w3.org>
- Date: Fri, 17 Dec 2004 19:49:49 +0100
- To: Ed Simon <edsimon@xmlsec.com>
- Cc: www-xkms@w3.org
- Message-ID: <20041217184949.GN1217@inrialpes.fr>
Hi Ed, On Fri, Dec 17, 2004 at 10:28:03AM -0500, Ed Simon wrote: > > I generally agree with Jose and Guillermo's recommendations EXCEPT for the > one about filtering UTF-8 characters outside the ASCII32-127. Unless, there > is a verifiable case to be made for disallowing non-Latin characters (eg. > Korean pass phrases) I would not include that possibility. Ultimately, the > pass phrase is just '1's and '0's and all we are doing is saying how a > human-readable/writable phrase can be consistently converted into binary; > that MAY not always mean the end device has to understand Unicode, just > binary. (I say MAY because I'm not a mobile device expert, I just want > someone who is to say non-ASCII is a problem before we try to accommodate > it.) I agree with you that we should try to support all characters. In fact, after re-reading this part of the spec, I am confused as to how this shared-string will be used and, in that case, I think that my previous message was also confusing. If the goal of 8.1 is giving some advice about how to make shared strings that can be read over the phone, I didn't understand it as such and more text should be added to explain this, or even a different subsection. If 8.1 is giving an algorithm for converting a shared string (regardless of its content) into a key, using one-way functions (i.e., we won't be able to find the shared string from this key), then we can use any algorithm to generate it. This can include removing spaces, punctuation, control characters, if any of these represent some cryptographic weakness (I let Stephen comment on this)... As you pointed out, the shared string is just a sequence of bytes, so it's not really important how they are converted. The result is being coded into BASE 64. If when converting the shared string into a key, there is a risk that the string has to be processed by some tools that may not understand non-ascii charsets or control characters, it does make sense to me to remove control characters and translate the string into UTF-8 first. I'll ask my colleagues to comment on tools having problems handling non-utf 8 characters (which I think can happen). I'll also ask my colleagues to comment on handling of UTF-8 in mobile devices, just to be on the safe side. If we can assume that those devices always work with UTF-8 or know how to translate their local charset into UTF-8, I think we would be all set. > I would drop mention of "XML Encoding" and call it "UTF-8" encoding; not > only do I think this is sensible from the outset but it also gets rid of > trying to process XMLese like entities etc. I confess that I have one > question which is I am not absolutely sure (eg. due to combining sequences) > there is always one and only one binary representation for every unique > UTF-8-encoded pass phrase; Jose, can you verify that with a W3C UTF-8 > expert. A follow-up question would be whether we could use rules to > canonicalise the UTF-8 (eg. do not use combining characters) if there is > more than one binary representation. I'm not sure if I understand what you mean by combining sequences in this context. Let me quote first a slighly modified man 7 utf-8: <quote> The Unicode 3.0 character set occupies a 16-bit code space. UTF-8 is a encoding that allows to represent characters as a sequence of octets, which can then be, e.g., transmitted on mail without problem. Unicode Characters from from 0 to 127 are encoded directly, without no changes. Unicode characters beyond 127 are encoded using a multibyte sequence consisting only of bytes in the range 0x80 to 0xfd, so no ASCII byte can appear as part of another character and there are no problems with e.g. '\0' or '/'. mx00 to 0x7f </quote> Every character in Unicode space can be represented by one or more bytes in UTF-8. There can be no collision between different characters, provided you keep all the bits. And you can combine characters that come from different charsets. If you code a pass-phrase into UTF-8, you should be able to decode each character back into its original charset. (I'm not sure if using charset here is the best term... maybe glyph would be better). If this is what you refer as combining sequences (of different characters), there's no problem to do so in UTF-8. -jose
Received on Friday, 17 December 2004 18:50:26 UTC