Re: UTF-8 and BOM from tgindin@us.ibm.com on 2000-08-23 (w3c-ietf-xmldsig@w3.org from July to September 2000)

From: <tgindin@us.ibm.com>
Date: Wed, 23 Aug 2000 13:39:21 -0400
To: "Joseph M. Reagle Jr." <reagle@w3.org>
cc: w3c-ietf-xmldsig@w3.org, duerst@w3.org
Message-ID: <85256944.0060FDBD.00@D51MTA04.pok.ibm.com>

     If we retain wording excluding BOM's from UTF-8, as we currently have
it, I think that we should exclude surrogates as well.
     The current text in section 6.5.1 reads "converts the character
encoding to UTF-8 (without any byte order mark (BOM)) ", and corresponding
text in section 7 reads "that coded character set is UTF-8 (without a byte
order mark (BOM))"  The new text should probably read "... UTF-8 (without a
byte order mark (BOM) and with surrogate pairs converted to UCS-4 before
conversion to UTF-8)" in both of these places.  I realize that RFC 2279
(not 2379) explicitly requires surrogate conversion while it fails to
mention BOM's for some reason, but the two issues are similar and many
implementors do not understand the surrogate issue.  The wording about
surrogates in versions 2.0 of the Unicode standard is actually somewhat
similar to the wording about the "reversed byte order mark" U+FFFE.

          Tom Gindin

Received on Wednesday, 23 August 2000 13:40:10 UTC