- From: Keld Jørn Simonsen <keld@dkuug.dk>
- Date: Fri, 10 Jan 2003 17:55:17 +0100
- To: Francois Yergeau <FYergeau@alis.com>
- Cc: ietf-charsets@iana.org
On Fri, Jan 10, 2003 at 11:23:46AM -0500, Francois Yergeau wrote: > I wish to propose 2 changes to the UTF-8 draft: > > (1) restrict to 4-byte sequences, i.e. remove the 5- and 6-byte sequences > > (2) refer normatively to Unicode 3.2 > > The rationale for (1) is that Unicode is restricted to the 0-10FFFF range of > code points and therefore 5- and 6-byte sequences cannot occur. 10646 is > not officially so restricted but has a policy to not encode anything past > 10FFFF and has actually removed Private Use Areas beyond 10FFFF to > accomodate Unicode. Another reason is that there is much Fear, Uncertainty > and Doubt regarding this issue; an example is this mail excerpt received > this morning on the ietf-822@imc.org list: I think you should keep the specification aligned with 10646, also in the interest in being liberal in what you accept, an old and good IETF practice. > The rationale for (2) is that Unicode 3.2 now has a better, stricter > definition of UTF-8 than 10646. Specifically, the difference concerns the > encoding of surrogate code points, in the range D800-DFFF. 10646 only has a > Note (presumably non-normative) pointing out that the mapping of those code > points to UTF-8 is undefined; it doesn't make it an error to decode UTF-8 to > those code points, although it discusses other error cases, and therefore > opens the door to the dangerous practice of decoding double-surrogate 6-byte > sequences into a single non-BMP character. The recent Unicode 3.2 spec of > UTF-8 clearly and squarely forbids this practice and is therefore, IMHO, > what the Internet spec of UTF-8 needs. Using Unicode is also more > consistent with (1). 10646 could remain as the normative reference for the > characters themselves. > > Opinions? I think we should keep ourselves to open standards whenever possible, and avoid industry standards like Unicode if we can. 10646 is pretty explicit about not using surrogates in UTF-8, as far as I know. Always was. Kind regards keld
Received on Friday, 10 January 2003 11:55:59 UTC