[Fwd: Corrigendum #9 clarifies noncharacter usage in Unicode]

FYI, though it so far hasn't made a ripple in XML circles...

---------------------------- Original Message ----------------------------
Subject: Corrigendum #9 clarifies noncharacter usage in Unicode
From:    announcements@unicode.org
Date:    Wed, February 20, 2013 8:49 pm
To:      announcements@unicode.org
--------------------------------------------------------------------------

There has been confusion about whether noncharacters were permitted in
Unicode text. The new Corrigendum #9: Clarification About Noncharacters
<http://www.unicode.org/versions/corrigendum9.html> makes it clear that
noncharacters are permissible even in open interchange, although their
intended semantics may not beinterpretable in such contexts. The UTF-8,
UTF-16, UTF-32 & BOM FAQ <http://www.unicode.org/faq/utf_bom.html> has
also been updated for clarity, and other informative text about
noncharacters will be revised over time, including the Core Specification.

Background. There are 66 noncharacters permanently reserved for internal
use, typically used for some sort of control function or sentinel value.
They should be supported by APIs, components, and applications that
handle (i.e., either process or pass through) all Unicode strings, such
as a text editor or string class. Where an application does make
internal use of a noncharacter, it should take some measures to sanitize
input text from unknown sources. The best practice is to replace that
particular noncharacter on input by U+FFFD. (The noncharacter should not
be simply deleted, since that has security problems. For more
information, see Section 3.5 Deletion of Code Points
<http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters> in UTR
#36, Unicode Security Guidelines <http://www.unicode.org/reports/tr36/>.)

http://unicode-inc.blogspot.com/2013/02/corrigendum-9-clarifies-noncharacter.html

Received on Wednesday, 27 February 2013 20:07:04 UTC