Re: Fwd: Re: HRRIs, IRIs, etc from John Cowan on 2007-06-26 (public-i18n-core@w3.org from April to June 2007)

From: John Cowan <cowan@ccil.org>
Date: Tue, 26 Jun 2007 11:42:43 -0400
To: Richard Tobin <richard@inf.ed.ac.uk>
Cc: Martin Duerst <duerst@it.aoyama.ac.jp>, "Grosso, Paul" <pgrosso@ptc.com>, public-iri@w3.org, Richard Ishida <ishida@w3.org>, Felix Sasaki <fsasaki@w3.org>, www-xml-linking-comments@w3.org, public-xml-core-wg@w3.org, public-i18n-core@w3.org
Message-ID: <20070626154243.GD22864@mercury.ccil.org>

Richard Tobin scripsit:

> I see XML discourages FDD*, but the ucschar excludes both FDD* and
> FDE*.  Does anyone know the reason for this discrepancy?  FDE* seem to
> be also "not a character".

Almost certainly a blunder on my part.  The correct range of
non-characters is FDD0-FDEF.

> ucschar also excludes E0***, which seem to be "tags" - what does that
> mean?

E0000-E007F are a clone of ASCII, dedicated to encoding language tags in
plain text in a context where language tagging is considered essential but
full markup too complex or expensive.  Thus the language tag "en" would
be encoded as E0065 E006E.  These characters were born deprecated, and
served to stave off the attempt of a certain IETF WG to abuse otherwise
reserved UTF-8 forms to the same purpose.

E0010-E01EF are variant selectors, attached to ordinary chaacters to
specify variant forms of characters that are important or unpredictable
in certain contexts, but in other contexts are equivalent to the
forms without variant selectors.  E01FF-E0FFF are reserved for other
"default-ignorable" characters; processes that do not understand these
characters ought to ignore them (and not render them as boxes, etc.).

> ucschar also exclude FFF*, but XML makes no mention of them, except
> of course FFFE and FFFF which aren't allowed in XML at all.

FFF0-FFF8 are currently unassigned.  FFF9-FFFB are used to do ruby in
plain text, FFFC is a placeholder for a non-character object, and FFFD
is used to replace an incoming character whose value is unknown or has
no Unicode equivalent.

We should issue an erratum for XML 1.0/1.1 adding FDE0-FDEF, E0000-E007F,
and FFF0-FFFD to the discouraged characters list, as all of them have
better equivalents in markup.  Likewise, the characters 0340, 0341,
17A3, 17D3, and 206A-206F should be discouraged, as they are in Unicode.
E0010-E01EF are still useful in XML character content, though probably
not in *RIs.

-- 
What has four pairs of pants, lives             John Cowan
in Philadelphia, and it never rains             http://www.ccil.org/~cowan
but it pours?                                   cowan@ccil.org
        --Rufus T. Firefly

Received on Tuesday, 26 June 2007 15:43:08 UTC