Re: draft-yergeau-rfc2279bis-04.txt... from Patrik Fältström on 2003-03-06 (ietf-charsets@w3.org from January to March 2003)

From: Patrik Fältström <paf@cisco.com>
Date: Thu, 06 Mar 2003 10:37:12 +0100
To: Francois Yergeau <FYergeau@alis.com>
Cc: ietf-charsets@iana.org
Message-id: <354D669A-4FB7-11D7-82B4-0003934B2128@cisco.com>
Thanks!

If no-one have any issues with this, I hereby declare this done, and I 
will take over from here.

Francois, do you have your findings when doing the interoperability 
tests earlier on some webpage somewhere?

   paf

On måndag, feb 17, 2003, at 21:43 Europe/Stockholm, Francois Yergeau 
wrote:

> ...was just submitted to I-D and follows.
>
> Changes are editorial only and designed to meet all the nits required 
> for
> RFC publication (cf. http://www.ietf.org/ID-nits.html).
>
> - Added Intellectual Property Statement near the end.
>
> - Added a few missing people in Acknowledgements.
>
> - Shortened the Changes section to list only significant changes from 
> RFC
> 2279.
>
> - Used compact mode to save trees.  Gone from 22 down to 15 pages.
>
> -- 
> François Yergeau
>
>
>
>
> Network Working Group                                         F. 
> Yergeau
> Internet-Draft                                         Alis 
> Technologies
> Expires: August 18, 2003                               February 17, 
> 2003
>
>
>               UTF-8, a transformation format of ISO 10646
>                       draft-yergeau-rfc2279bis-04
>
> Status of this Memo
>
>    This document is an Internet-Draft and is in full conformance with
>    all provisions of Section 10 of RFC2026.
>
>    Internet-Drafts are working documents of the Internet Engineering
>    Task Force (IETF), its areas, and its working groups. Note that 
> other
>    groups may also distribute working documents as Internet-Drafts.
>
>    Internet-Drafts are draft documents valid for a maximum of six 
> months
>    and may be updated, replaced, or obsoleted by other documents at any
>    time. It is inappropriate to use Internet-Drafts as reference
>    material or to cite them other than as "work in progress."
>
>    The list of current Internet-Drafts can be accessed at
>    http://www.ietf.org/ietf/1id-abstracts.txt.
>
>    The list of Internet-Draft Shadow Directories can be accessed at
>    http://www.ietf.org/shadow.html.
>
>    This Internet-Draft will expire on August 18, 2003.
>
> Copyright Notice
>
>    Copyright (C) The Internet Society (2003). All Rights Reserved.
>
> Abstract
>
>    ISO/IEC 10646-1 defines a large character set called the Universal
>    Character Set (UCS) which encompasses most of the world's writing
>    systems. The originally proposed encodings of the UCS, however, were
>    not compatible with many current applications and protocols, and 
> this
>    has led to the development of UTF-8, the object of this memo. UTF-8
>    has the characteristic of preserving the full US-ASCII range,
>    providing compatibility with file systems, parsers and other 
> software
>    that rely on US-ASCII values but are transparent to other values.
>    This memo obsoletes and replaces RFC 2279.
>
>
>
>
>
>
>
> Yergeau                 Expires August 18, 2003                 [Page 
> 1]
> 
> Internet-Draft                   UTF-8                     February 
> 2003
>
>
> Table of Contents
>
>    1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  
> 3
>    2.  Notational conventions . . . . . . . . . . . . . . . . . . . .  
> 4
>    3.  UTF-8 definition . . . . . . . . . . . . . . . . . . . . . . .  
> 4
>    4.  Syntax of UTF-8 Byte Sequences . . . . . . . . . . . . . . . .  
> 6
>    5.  Versions of the standards  . . . . . . . . . . . . . . . . . .  
> 6
>    6.  Byte order mark (BOM)  . . . . . . . . . . . . . . . . . . . .  
> 7
>    7.  Examples . . . . . . . . . . . . . . . . . . . . . . . . . . .  
> 9
>    8.  MIME registration  . . . . . . . . . . . . . . . . . . . . . .  
> 9
>    9.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 
> 10
>    10. Security Considerations  . . . . . . . . . . . . . . . . . . . 
> 10
>    11. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 
> 11
>    12. Changes from RFC 2279  . . . . . . . . . . . . . . . . . . . . 
> 11
>        Normative references . . . . . . . . . . . . . . . . . . . . . 
> 12
>        Informative references . . . . . . . . . . . . . . . . . . . . 
> 12
>        Author's Address . . . . . . . . . . . . . . . . . . . . . . . 
> 13
>        Intellectual Property and Copyright Statements . . . . . . . . 
> 14
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Yergeau                 Expires August 18, 2003                 [Page 
> 2]
> 
> Internet-Draft                   UTF-8                     February 
> 2003
>
>
> 1. Introduction
>
>    ISO/IEC 10646 [ISO.10646] defines a large character set called the
>    Universal Character Set (UCS), which encompasses most of the world's
>    writing systems. The same set of characters is defined by the 
> Unicode
>    standard [UNICODE], which further defines additional character
>    properties and other application details of great interest to
>    implementers.  Up to the present time, changes in Unicode and
>    amendments and additions to ISO/IEC 10646 have tracked each other, 
> so
>    that the character repertoires and code point assignments have
>    remained in sync.  The relevant standardization committees have
>    committed to maintain this very useful synchronism.
>
>    ISO/IEC 10646 and Unicode define several encoding forms of their
>    common repertoire: UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32. In an
>    encoding form, each character is represented as one or more encoding
>    units. All standard UCS encoding forms except UTF-8 have an encoding
>    unit larger than one octet, making them hard to use in many current
>    applications and protocols that assume 8 or even 7 bit characters.
>
>    UTF-8, the object of this memo, has a one-octet encoding unit. It
>    uses all bits of an octet, but has the quality of preserving the 
> full
>    US-ASCII [US-ASCII] range: US-ASCII characters are encoded in one
>    octet having the normal US-ASCII value, and any octet with such a
>    value can only stand for a US-ASCII character, and nothing else.
>
>    UTF-8 encodes UCS characters as a varying number of octets, where 
> the
>    number of octets, and the value of each, depend on the integer value
>    assigned to the character in ISO/IEC 10646 (the character number,
>    a.k.a. code point or Unicode scalar value). This encoding form has
>    the following characteristics (all values are in hexadecimal):
>
>    o  Character numbers from U+0000 to U+007F (US-ASCII repertoire)
>       correspond to octets 00 to 7F (7 bit US-ASCII values). A direct
>       consequence is that a plain ASCII string is also a valid UTF-8
>       string.
>
>    o  US-ASCII octet values do not appear otherwise in a UTF-8 encoded
>       character stream.  This provides compatibility with file systems
>       or other software (e.g. the printf() function in C libraries) 
> that
>       parse based on US-ASCII values but are transparent to other
>       values.
>
>    o  Round-trip conversion is easy between UTF-8 and other encoding
>       forms.
>
>    o  The first octet of a multi-octet sequence indicates the number of
>       octets in the sequence.
>
>
>
> Yergeau                 Expires August 18, 2003                 [Page 
> 3]
> 
> Internet-Draft                   UTF-8                     February 
> 2003
>
>
>    o  The octet values C0, C1, FE and FF never appear. If the range of
>       character numbers is restricted to U+0000..U+10FFFF (the UTF-16
>       accessible range), then the octet values F5..FD also never 
> appear.
>
>    o  Character boundaries are easily found from anywhere in an octet
>       stream.
>
>    o  The lexicographic sorting order of UTF-8 strings is the same as 
> if
>       ordered by character numbers.  Of course this is of limited
>       interest since a sort order based on character numbers is not
>       culturally valid.
>
>    o  The Boyer-Moore fast search algorithm can be used with UTF-8 
> data.
>
>    o  UTF-8 strings can be fairly reliably recognized as such by a
>       simple algorithm, i.e. the probability that a string of 
> characters
>       in any other encoding appears as valid UTF-8 is low, diminishing
>       with increasing string length.
>
>    UTF-8 was originally a project of the X/Open Joint
>    Internationalization Group XOJIG with the objective to specify a 
> File
>    System Safe UCS Transformation Format [FSS_UTF] that is compatible
>    with UNIX systems, supporting multilingual text in a single 
> encoding.
>    The original authors were Gary Miller, Greger Leijonhufvud and John
>    Entenmann.  Later, Ken Thompson and Rob Pike did significant work 
> for
>    the formal definition of UTF-8.
>
> 2. Notational conventions
>
>    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
>    "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
>    document are to be interpreted as described in [RFC2119].
>
>    UCS characters are designated by the U+HHHH notation, where HHHH is 
> a
>    string of from 4 to 6 hexadecimal digits representing the character
>    number in ISO/IEC 10646.
>
> 3. UTF-8 definition
>
>    UTF-8 is defined by the Unicode Standard [UNICODE]. Descriptions and
>    formulae can also be found in Annex D of ISO/IEC 10646-1 [ISO.10646]
>
>    In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
>    accessible range) are encoded using sequences of 1 to 4 octets. The
>    only octet of a "sequence" of one has the higher-order bit set to 0,
>    the remaining 7 bits being used to encode the character number. In a
>    sequence of n octets, n>1, the initial octet has the n higher-order
>    bits set to 1, followed by a bit set to 0.  The remaining bit(s) of
>
>
>
> Yergeau                 Expires August 18, 2003                 [Page 
> 4]
> 
> Internet-Draft                   UTF-8                     February 
> 2003
>
>
>    that octet contain bits from the number of the character to be
>    encoded.  The following octet(s) all have the higher-order bit set 
> to
>    1 and the following bit set to 0, leaving 6 bits in each to contain
>    bits from the character to be encoded.
>
>    The table below summarizes the format of these different octet 
> types.
>    The letter x indicates bits available for encoding bits of the
>    character number.
>
>    Char. number range  |        UTF-8 octet sequence
>       (hexadecimal)    |              (binary)
>    --------------------+---------------------------------------------
>    0000 0000-0000 007F | 0xxxxxxx
>    0000 0080-0000 07FF | 110xxxxx 10xxxxxx
>    0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
>    0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
>
>    Encoding a character to UTF-8 proceeds as follows:
>
>    1.  Determine the number of octets required from the character 
> number
>        and the first column of the table above.  It is important to 
> note
>        that the rows of the table are mutually exclusive, i.e. there is
>        only one valid way to encode a given character.
>
>    2.  Prepare the high-order bits of the octets as per the second
>        column of the table.
>
>    3.  Fill in the bits marked x from the bits of the character number,
>        expressed in binary. Start by putting the lowest-order bit of 
> the
>        character number in the lowest-order position of the last octet
>        of the sequence, then put the next higher-order bit of the
>        character number in the next higher-order position of that 
> octet,
>        etc.  When the x bits of the last octet are filled in, move on 
> to
>        the next to last octet, then to the preceding one, etc. until 
> all
>        x bits are filled in.
>
>    The definition of UTF-8 prohibits encoding character numbers between
>    U+D800 and U+DFFF, which are reserved for use with the UTF-16
>    encoding form (as surrogate pairs) and do not directly represent
>    characters. When encoding in UTF-8 from UTF-16 data, it is necessary
>    to first decode the UTF-16 data to obtain character numbers, which
>    are then encoded in UTF-8 as described above. This contrasts with
>    CESU-8 [CESU-8], which is a UTF-8-like encoding that is not meant 
> for
>    use on the Internet. CESU-8 operates similarly to UTF-8 but encodes
>    the UTF-16 code values (16-bit quantities) instead of the character
>    number (code point). This leads to different results for character
>    numbers above 0xFFFF; the CESU-8 encoding of those characters is NOT
>    valid UTF-8.
>
>
>
> Yergeau                 Expires August 18, 2003                 [Page 
> 5]
> 
> Internet-Draft                   UTF-8                     February 
> 2003
>
>
>    Decoding a UTF-8 character proceeds as follows:
>
>    1.  Initialize a binary number with all bits set to 0. Up to 21 bits
>        may be needed.
>
>    2.  Determine which bits encode the character number from the number
>        of octets in the sequence and the second column of the table
>        above (the bits marked x).
>
>    3.  Distribute the bits from the sequence to the binary number, 
> first
>        the lower-order bits from the last octet of the sequence and
>        proceeding to the left until no x bits are left. The binary
>        number is now equal to the character number.
>
>    Implementations of the decoding algorithm above MUST protect against
>    decoding invalid sequences.  For instance, a naive implementation 
> may
>    decode the overlong UTF-8 sequence C0 80 into the character U+0000,
>    or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding
>    invalid sequences may have security consequences or cause other
>    problems.  See Security Considerations (Section 10) below.
>
> 4. Syntax of UTF-8 Byte Sequences
>
>    A UTF-8 string is a sequence of octets representing a sequence of 
> UCS
>    characters. An octet sequence is valid UTF-8 only if it matches the
>    following syntax, which is derived from the rules for encoding UTF-8
>    and is expressed in the ABNF of [RFC2234].
>
>    UTF8-octets = *( UTF8-char )
>    UTF8-char   = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
>    UTF8-1      = %x00-7F
>    UTF8-2      = %xC2-DF UTF8-tail
>    UTF8-3      = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
>                  %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
>    UTF8-4      = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
>                  %xF4 %x80-8F 2( UTF8-tail )
>    UTF8-tail   = %x80-BF
>
> 5. Versions of the standards
>
>    ISO/IEC 10646 is updated from time to time by publication of
>    amendments and additional parts; similarly, new versions of the
>    Unicode standard are published over time. Each new version obsoletes
>    and replaces the previous one, but implementations, and more
>    significantly data, are not updated instantly.
>
>    In general, the changes amount to adding new characters, which does
>    not pose particular problems with old data.  In 1996, Amendment 5 to
>
>
>
> Yergeau                 Expires August 18, 2003                 [Page 
> 6]
> 
> Internet-Draft                   UTF-8                     February 
> 2003
>
>
>    the 1993 edition of ISO/IEC 10646 and Unicode 2.0 moved and expanded
>    the Korean Hangul block, thereby making any previous data containing
>    Hangul characters invalid under the new version.  Unicode 2.0 has 
> the
>    same difference from Unicode 1.1. The justification for allowing 
> such
>    an incompatible change was that there were no major implementations
>    and no significant amounts of data containing Hangul.  The incident
>    has been dubbed the "Korean mess", and the relevant committees have
>    pledged to never, ever again make such an incompatible change (see
>    Unicode Consortium Policies [1]).
>
>    New versions, and in particular any incompatible changes, have
>    consequences regarding MIME charset labels, to be discussed in MIME
>    registration (Section 8).
>
> 6. Byte order mark (BOM)
>
>    The UCS character U+FEFF "ZERO WIDTH NO-BREAK SPACE" is also known
>    informally as "BYTE ORDER MARK" (abbreviated "BOM"). This character
>    can be used as a genuine "ZERO WIDTH NO-BREAK SPACE" within text, 
> but
>    the BOM name hints at a second possible usage of the character:  to
>    prepend a U+FEFF character to a stream of UCS characters as a
>    "signature". A receiver of such a serialized stream may then use the
>    initial character as a hint that the stream consists of UCS
>    characters and also to recognize which UCS encoding is involved and,
>    with encodings having a multi-octet encoding unit, as a way to
>    recognize the serialization order of the octets.  UTF-8 having a
>    single-octet encoding unit, this last function is useless and the 
> BOM
>    will always appear as the octet sequence EF BB BF.
>
>    It is important to understand that the character U+FEFF appearing at
>    any position other than the beginning of a stream MUST be 
> interpreted
>    with the semantics for the zero-width non-breaking space, and MUST
>    NOT be interpreted as a signature. When interpreted as a signature,
>    the Unicode standard suggests than an initial U+FEFF character may 
> be
>    stripped before processing the text. Such stripping is necessary in
>    some cases (e.g. when concatenating two strings, because otherwise
>    the resulting string may contain an unintended "ZERO WIDTH NO-BREAK
>    SPACE" at the connection point), but might affect an external 
> process
>    at a different layer (such as a digital signature or a count of the
>    characters) that is relying on the presence of all characters in the
>    stream.  It is therefore RECOMMENDED to avoid stripping an initial
>    U+FEFF interpreted as a signature without a good reason, to ignore 
> it
>    instead of stripping it when appropriate (such as for display) and 
> to
>    strip it only when really necessary.
>
>    U+FEFF in the first position of a stream MAY be interpreted as a
>    zero-width non-breaking space, and is not always a signature. In an
>    attempt at diminishing this uncertainty, Unicode 3.2 adds a new
>
>
>
> Yergeau                 Expires August 18, 2003                 [Page 
> 7]
> 
> Internet-Draft                   UTF-8                     February 
> 2003
>
>
>    character, U+2060 "WORD JOINER", with exactly the same semantics and
>    usage as U+FEFF except for the signature function, and strongly
>    recommends its exclusive use for expressing word-joining semantics.
>    Eventually, following this recommendation will make it all but
>    certain that any initial U+FEFF is a signature, not an intended 
> "ZERO
>    WIDTH NO-BREAK SPACE".
>
>    In the meantime, the uncertainty unfortunately remains and may 
> affect
>    Internet protocols. Protocol specifications MAY restrict usage of
>    U+FEFF as a signature in order to reduce or eliminate the potential
>    ill effects of this uncertainty. In the interest of striking a
>    balance between the advantages (reduction of uncertainty) and
>    drawbacks (loss of the signature function) of such restrictions, it
>    is useful to distinguish a few cases:
>
>    o  A protocol SHOULD forbid use of U+FEFF as a signature for those
>       textual protocol elements that the protocol mandates to be always
>       UTF-8, the signature function being totally useless in those
>       cases.
>
>    o  A protocol SHOULD also forbid use of U+FEFF as a signature for
>       those textual protocol elements for which the protocol provides
>       character encoding identification mechanisms, when it is expected
>       that implementations of the protocol will be in a position to
>       always use the mechanisms properly.  This will be the case when
>       the protocol elements are maintained tightly under the control of
>       the implementation from the time of their creation to the time of
>       their (properly labeled) transmission.
>
>    o  A protocol SHOULD NOT forbid use of U+FEFF as a signature for
>       those textual protocol elements for which the protocol does not
>       provide character encoding identification mechanisms, when a ban
>       would be unenforceable, or when it is expected that
>       implementations of the protocol will not be in a position to
>       always use the mechanisms properly.  The latter two cases are
>       likely to occur with larger protocol elements such as MIME
>       entities, especially when implementations of the protocol will
>       obtain such entities from file systems, from protocols that do 
> not
>       have encoding identification mechanisms for payloads (such as 
> FTP)
>       or from other protocols that do not guarantee proper
>       identification of character encoding (such as HTTP).
>
>    When a protocol forbids use of U+FEFF as a signature for a certain
>    protocol element, then any initial U+FEFF in that protocol element
>    MUST be interpreted as a "ZERO WIDTH NO-BREAK SPACE". When a 
> protocol
>    does NOT forbid use of U+FEFF as a signature for a certain protocol
>    element, then implementations SHOULD be prepared to handle a
>    signature in that element and react appropriately: using the
>
>
>
> Yergeau                 Expires August 18, 2003                 [Page 
> 8]
> 
> Internet-Draft                   UTF-8                     February 
> 2003
>
>
>    signature to identify the character encoding as necessary and
>    stripping or ignoring the signature as appropriate.
>
> 7. Examples
>
>    The character sequence U+0041 U+2262 U+0391 U+002E "A<NOT IDENTICAL
>    TO><ALPHA>." is encoded in UTF-8 as follows:
>
>        --+--------+-----+--
>        41 E2 89 A2 CE 91 2E
>        --+--------+-----+--
>
>    The character sequence U+D55C U+AD6D U+C5B4 (Korean "hangugeo",
>    meaning "the Korean language") is encoded in UTF-8 as follows:
>
>        --------+--------+--------
>        ED 95 9C EA B5 AD EC 96 B4
>        --------+--------+--------
>
>    The character sequence U+65E5 U+672C U+8A9E (Japanese "nihongo",
>    meaning "the Japanese language") is encoded in UTF-8 as follows:
>
>        --------+--------+--------
>        E6 97 A5 E6 9C AC E8 AA 9E
>        --------+--------+--------
>
>    The character U+233B4 (a Chinese character meaning 'stump of tree'),
>    prepended with a UTF-8 BOM, is encoded in UTF-8 as follows:
>
>        --------+-----------
>        EF BB BF F0 A3 8E B4
>        --------+-----------
>
> 8. MIME registration
>
>    This memo serves as the basis for registration of the MIME charset
>    parameter for UTF-8, according to [RFC2978].  The charset parameter
>    value is "UTF-8".  This string labels media types containing text
>    consisting of characters from the repertoire of ISO/IEC 10646
>    including all amendments at least up to amendment 5 of the 1993
>    edition (Korean block), encoded to a sequence of octets using the
>    encoding scheme outlined above.  UTF-8 is suitable for use in MIME
>    content types under the "text" top-level type.
>
>    It is noteworthy that the label "UTF-8" does not contain a version
>    identification, referring generically to ISO/IEC 10646.  This is
>    intentional, the rationale being as follows:
>
>
>
>
> Yergeau                 Expires August 18, 2003                 [Page 
> 9]
> 
> Internet-Draft                   UTF-8                     February 
> 2003
>
>
>    A MIME charset label is designed to give just the information needed
>    to interpret a sequence of bytes received on the wire into a 
> sequence
>    of characters, nothing more (see [RFC2045], section 2.2). As long as
>    a character set standard does not change incompatibly, version
>    numbers serve no purpose, because one gains nothing by learning from
>    the tag that newly assigned characters may be received that one
>    doesn't know about.  The tag itself doesn't teach anything about the
>    new characters, which are going to be received anyway.
>
>    Hence, as long as the standards evolve compatibly, the apparent
>    advantage of having labels that identify the versions is only that,
>    apparent.  But there is a disadvantage to such version-dependent
>    labels: when an older application receives data accompanied by a
>    newer, unknown label, it may fail to recognize the label and be
>    completely unable to deal with the data, whereas a generic, known
>    label would have triggered mostly correct processing of the data,
>    which may well not contain any new characters.
>
>    Now the "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible
>    change, in principle contradicting the appropriateness of a version
>    independent MIME charset label as described above.  But the
>    compatibility problem can only appear with data containing Korean
>    Hangul characters encoded according to Unicode 1.1 (or equivalently
>    ISO/IEC 10646 before amendment 5), and there is arguably no such 
> data
>    to worry about, this being the very reason the incompatible change
>    was deemed acceptable.
>
>    In practice, then, a version-independent label is warranted, 
> provided
>    the label is understood to refer to all versions after Amendment 5,
>    and provided no incompatible change actually occurs.  Should
>    incompatible changes occur in a later version of ISO/IEC 10646, the
>    MIME charset label defined here will stay aligned with the previous
>    version until and unless the IETF specifically decides otherwise.
>
> 9. IANA Considerations
>
>    The entry for UTF-8 in the IANA charset registry should be updated 
> to
>    point to this memo.
>
> 10. Security Considerations
>
>    Implementers of UTF-8 need to consider the security aspects of how
>    they handle illegal UTF-8 sequences.  It is conceivable that in some
>    circumstances an attacker would be able to exploit an incautious
>    UTF-8 parser by sending it an octet sequence that is not permitted 
> by
>    the UTF-8 syntax.
>
>    A particularly subtle form of this attack can be carried out against
>
>
>
> Yergeau                 Expires August 18, 2003                [Page 
> 10]
> 
> Internet-Draft                   UTF-8                     February 
> 2003
>
>
>    a parser which performs security-critical validity checks against 
> the
>    UTF-8 encoded form of its input, but interprets certain illegal 
> octet
>    sequences as characters.  For example, a parser might prohibit the
>    NUL character when encoded as the single-octet sequence 00, but
>    erroneously allow the illegal two-octet sequence C0 80 and interpret
>    it as a NUL character.  Another example might be a parser which
>    prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
>    illegal octet sequence 2F C0 AE 2E 2F. This last exploit has 
> actually
>    been used in a widespread virus attacking Web servers in 2001; the
>    security threat is thus very real.
>
>    Another security issue occurs when encoding to UTF-8: the ISO/IEC
>    10646 description of UTF-8 allows encoding character numbers up to
>    U+7FFFFFFF, yielding sequences of up to 6 bytes.  There is therefore
>    a risk of buffer overflow if the range of character numbers is not
>    explicitly limited to U+10FFFF or if buffer sizing doesn't take into
>    account the possibility of 5- and 6-byte sequences.
>
> 11. Acknowledgements
>
>    The following have participated in the drafting and discussion of
>    this memo: James E. Agenbroad, Harald Alvestrand, Andries Brouwer,
>    Mark Davis, Martin J. Duerst, Patrick Faltstrom, Ned Freed, David
>    Goldsmith, Tony Hansen, Edwin F. Hart, Paul Hoffman, David Hopwood,
>    Simon Josefsson, Kent Karlsson, Dan Kohn, Markus Kuhn, Michael Kung,
>    Alain LaBonte, Ira McDonald, Alexey Melnikov, MURATA Makoto, John
>    Gardiner Myers, Dan Oscarsson, Roozbeh Pournader, Murray Sargent,
>    Markus Scherer, Keld Simonsen, Arnold Winkler, Kenneth Whistler and
>    Misha Wolf.
>
> 12. Changes from RFC 2279
>
>    o  Restricted the range of characters to 0000-10FFFF (the UTF-16
>       accessible range).
>
>    o  Made Unicode the source of the normative definition of UTF-8,
>       keeping ISO/IEC 10646 as the reference for characters.
>
>    o  Straightened out terminology. UTF-8 now described in terms of an
>       encoding form of the character number. UCS-2 and UCS-4 almost
>       disappeared.
>
>    o  Turned the note warning against decoding of invalid sequences 
> into
>       a normative MUST NOT.
>
>    o  Added a new section about the UTF-8 BOM, with advice for
>       protocols.
>
>
>
>
> Yergeau                 Expires August 18, 2003                [Page 
> 11]
> 
> Internet-Draft                   UTF-8                     February 
> 2003
>
>
>    o  Removed suggested UNICODE-1-1-UTF-8 MIME charset registration.
>
>    o  Added an ABNF syntax for valid UTF-8 octet sequences
>
> Normative references
>
>    [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
>               Requirement Levels", BCP 14, RFC 2119, March 1997.
>
>    [RFC2234]  Crocker, D. and P. Overell, "Augmented BNF for Syntax
>               Specifications: ABNF", RFC 2234, November 1997.
>
>    [ISO.10646]
>               International Organization for Standardization,
>               "Information Technology - Universal Multiple-octet coded
>               Character Set (UCS)", ISO/IEC Standard 10646,  comprised
>               of ISO/IEC 10646-1:2000, "Information technology --
>               Universal Multiple-Octet Coded Character Set (UCS) -- 
> Part
>               1: Architecture and Basic Multilingual Plane", ISO/IEC
>               10646-2:2001, "Information technology -- Universal
>               Multiple-Octet Coded Character Set (UCS) -- Part 2:
>               Supplementary Planes" and ISO/IEC 10646-1:2000/Amd 
> 1:2002,
>               "Mathematical symbols and other characters".
>
>    [UNICODE]  The Unicode Consortium, "The Unicode Standard -- Version
>               3.2",  defined by The Unicode Standard, Version 3.0
>               (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5),
>               as amended by the Unicode Standard Annex #27: Unicode 3.1
>               (see http://www.unicode.org/reports/tr27) and by the
>               Unicode Standard Annex #28: Unicode 3.2 (see
>               http://www.unicode.org/reports/tr28), March 2002,
>               <http://www.unicode.org/unicode/standard/versions/
>               enumeratedversions.html#Unicode_3_2_0>.
>
> Informative references
>
>    [CESU-8]   Phipps, T., "Compatibility Encoding Scheme for UTF-16:
>               8-Bit (CESU-8)", UTR 26, April 2002,
>               <http://www.unicode.org/unicode/reports/tr26/>.
>
>    [FSS_UTF]  X/Open Company Ltd., "X/Open CAE Specification C501 --
>               File System Safe UCS Transformation Format (FSS_UTF)",
>               ISBN 1-85912-082-2, April 1995.
>
>    [RFC2045]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
>               Extensions (MIME) Part One: Format of Internet Message
>               Bodies", RFC 2045, November 1996.
>
>
>
>
> Yergeau                 Expires August 18, 2003                [Page 
> 12]
> 
> Internet-Draft                   UTF-8                     February 
> 2003
>
>
>    [RFC2978]  Freed, N. and J. Postel, "IANA Charset Registration
>               Procedures", BCP 19, RFC 2978, October 2000.
>
>    [US-ASCII]
>               American National Standards Institute, "Coded Character
>               Set - 7-bit American Standard Code for Information
>               Interchange", ANSI X3.4, 1986.
>
> URIs
>
>    [1]  <http://www.unicode.org/unicode/standard/policies.html>
>
>
> Author's Address
>
>    Francois Yergeau
>    Alis Technologies
>    100, boul. Alexis-Nihon, bureau 600
>    Montreal, QC  H4M 2P2
>    Canada
>
>    Phone: +1 514 747 2547
>    Fax:   +1 514 747 2561
>    EMail: fyergeau@alis.com
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Yergeau                 Expires August 18, 2003                [Page 
> 13]
> 
> Internet-Draft                   UTF-8                     February 
> 2003
>
>
> Intellectual Property Statement
>
>    The IETF takes no position regarding the validity or scope of any
>    intellectual property or other rights that might be claimed to
>    pertain to the implementation or use of the technology described in
>    this document or the extent to which any license under such rights
>    might or might not be available; neither does it represent that it
>    has made any effort to identify any such rights. Information on the
>    IETF's procedures with respect to rights in standards-track and
>    standards-related documentation can be found in BCP-11. Copies of
>    claims of rights made available for publication and any assurances 
> of
>    licenses to be made available, or the result of an attempt made to
>    obtain a general license or permission for the use of such
>    proprietary rights by implementors or users of this specification 
> can
>    be obtained from the IETF Secretariat.
>
>    The IETF invites any interested party to bring to its attention any
>    copyrights, patents or patent applications, or other proprietary
>    rights which may cover technology that may be required to practice
>    this standard. Please address the information to the IETF Executive
>    Director.
>
>
> Full Copyright Statement
>
>    Copyright (C) The Internet Society (2003). All Rights Reserved.
>
>    This document and translations of it may be copied and furnished to
>    others, and derivative works that comment on or otherwise explain it
>    or assist in its implementation may be prepared, copied, published
>    and distributed, in whole or in part, without restriction of any
>    kind, provided that the above copyright notice and this paragraph 
> are
>    included on all such copies and derivative works. However, this
>    document itself may not be modified in any way, such as by removing
>    the copyright notice or references to the Internet Society or other
>    Internet organizations, except as needed for the purpose of
>    developing Internet standards in which case the procedures for
>    copyrights defined in the Internet Standards process must be
>    followed, or as required to translate it into languages other than
>    English.
>
>    The limited permissions granted above are perpetual and will not be
>    revoked by the Internet Society or its successors or assignees.
>
>    This document and the information contained herein is provided on an
>    "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
>    TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
>    BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
>
>
>
> Yergeau                 Expires August 18, 2003                [Page 
> 14]
> 
> Internet-Draft                   UTF-8                     February 
> 2003
>
>
>    HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
>    MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
>
>
> Acknowledgement
>
>    Funding for the RFC Editor function is currently provided by the
>    Internet Society.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Yergeau                 Expires August 18, 2003                [Page 
> 15]
> 
>
Received on Thursday, 6 March 2003 12:17:37 UTC