New draft-yergeau-rfc2279bis-02.txt from Francois Yergeau on 2002-10-10 (ietf-charsets@w3.org from October to December 2002)

From: Francois Yergeau <FYergeau@alis.com>
Date: Thu, 10 Oct 2002 00:09:14 -0400
To: ietf-charsets@iana.org
Message-id: <F7D4BDA0E5A1D14B99D32C022AEB73660EB2FA@alis-2k.alis.domain>
Just submitted.  Apart from date and filename, the only changes are in
section 6 "Byte Order Mark".  They are extensive, in an attempt to
accomodate all the comments on the BOM.  Here's the new section 6:

6. Byte order mark (BOM)
<36>
   The UCS character U+FEFF "ZERO WIDTH NO-BREAK SPACE" is also known
   informally as "BYTE ORDER MARK" (abbreviated "BOM").  This character
   can be used as a genuine "ZERO WIDTH NO-BREAK SPACE" within text, but
   the BOM name hints at a second possible usage of the character:  to
   prepend a U+FEFF character to a stream of UCS characters as a
   "signature".  A receiver of such a serialized stream may then use the
   initial character as a hint that the stream consists of UCS
   characters and also to recognize which UCS encoding is involved and,
   with encodings having a multi-octet encoding unit, as a way to
   recognize the serialization order of the octets.  UTF-8 having a
   single-octet encoding unit, this last function is useless and the BOM
   will always appear as the octet sequence EF BB BF.
<37>
   It is important to understand that the character U+FEFF appearing at
   any position other than the beginning of a stream MUST be interpreted
   with the semantics for the zero-width non-breaking space, and MUST
   NOT be interpreted as a signature.  When interpreted as a signature,
   the Unicode standard suggests than an initial U+FEFF character may be
   stripped before processing the text.  Such stripping is necessary in
   some cases (e.g.  when concatenating two strings, because otherwise
   the resulting string may contain an unintended "ZERO WIDTH NO-BREAK
   SPACE" at the connection point), but might affect an external process
   at a different layer (such as a digital signature or a count of the
   characters) that is relying on the presence of all characters in the
   stream.  It is therefore RECOMMENDED to avoid stripping an initial
   U+FEFF interpreted as a signature without a good reason, to ignore it
   instead of stripping it when appropriate (such as for display) and to
   strip it only when really necessary.
<38>
   U+FEFF in the first position of a stream MAY be interpreted as a
   zero-width non-breaking space, and is not always a signature.  In an
   attempt at diminishing this uncertainty, Unicode 3.2 adds a new
   character, U+2060 "WORD JOINER", with exactly the same semantics and
   usage as U+FEFF except for the signature function, and strongly
   recommends its exclusive use for expressing word-joining semantics.
   Eventually, following this recommendation will make it all but
   certain that any initial U+FEFF is a signature, not an intended "ZERO
   WIDTH NO-BREAK SPACE".
<39>
   In the meantime, the uncertainty unfortunately remains and may affect
   Internet protocols.  Protocol specifications MAY restrict usage of
   U+FEFF as a signature in order to reduce or eliminate the potential
   ill effects of this uncertainty.  In the interest of striking a
   balance between the advantages (reduction of uncertainty) and
   drawbacks (loss of the signature function) of such restrictions, it
   is useful to distinguish a few cases:
<40>
   o  A protocol SHOULD forbid use of U+FEFF as a signature for those
      textual protocol elements that the protocol mandates to be always
      UTF-8, the signature function being totally useless in those
      cases.
<41>
   o  A protocol SHOULD also forbid use of U+FEFF as a signature for
      those textual protocol elements for which the protocol provides
      character encoding identification mechanisms, when it is expected
      that implementations of the protocol will be in a position to
      always use the mechanisms properly.  This will be the case when
      the protocol elements are maintained tightly under the control of
      the implementation from the time of their creation to the time of
      their (properly labelled) transmission.
<42>
   o  A protocol SHOULD NOT forbid use of U+FEFF as a signature for
      those textual protocol elements for which the protocol does not
      provide character encoding identification mechanisms, when a ban
      would be unenforceable, or when it is expected that
      implementations of the protocol will not be in a position to
      always use the mechanisms properly.  The latter two cases are
      likely to occur with larger protocol elements such as MIME
      entities, especially when implementations of the protocol will
      obtain such entities from file systems, from protocols that do not
      have encoding identification mechanisms for payloads (such as FTP)
      or from other protocols that do not guarantee proper
      identification of character encoding (such as HTTP).

<43>
   When a protocol forbids use of U+FEFF as a signature for a certain
   protocol element, then any initial U+FEFF in that protocol element
   MUST be interpreted as a "ZERO WIDTH NO-BREAK SPACE".  When a
   protocol does NOT forbid use of U+FEFF as a signature for a certain
   protocol element, then implementations SHOULD be prepared to handle a
   signature in that element and react appropriately: using the
   signature to identify the character encoding as necessary and
   stripping or ignoring the signature as appropriate.

-- 
François Yergeau
Alis Technologies inc.
+1 514 747 2547
Received on Thursday, 10 October 2002 00:10:04 UTC