- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 18 Apr 2002 13:42:26 +0900
- To: Dan Oscarsson <Dan.Oscarsson@trab.se>, ietf-charsets@iana.org, FYergeau@alis.com
While I'm definitely an advocate of NFC, this isn't and should not be part of the definition of UTF-8. Maybe Francois finds a good place to put in a pointer to NFC and UAX #15, but it definitely shoudn't be part of the normative definition. Regards, Martin. At 13:44 02/04/17 +0200, Dan Oscarsson wrote: >I would also very much like UTF-8 to require that Unicode >normalisation form C has been used on the UCS encoded. >Otherwise can the same character sequence have >different UTF-8 codings. >While it is no problem to use overlong UTF-8 sequences, they >are forbidden in the document. This makes it impossible to >encode the same ASCII character sequence in several ways. >The same should be applied to all characters in UCS - only >one form should be allowed. >As form C do not destroy any data and is most compact, it is >the best choice. >So UTF-8 should REQUIRE the characters to be normalised >using form C. (note: text normalised using from KC will >work also, it it is normalised using form C it will result >in the same text). > >Having both BOM removed and form C required will make handling >of UTF-8 in software much simpler as well as less error and security >prone. > > Dan
Received on Thursday, 18 April 2002 01:44:35 UTC