- From: John Cowan <cowan@ccil.org>
- Date: Tue, 26 Jun 2007 11:42:43 -0400
- To: Richard Tobin <richard@inf.ed.ac.uk>
- Cc: Martin Duerst <duerst@it.aoyama.ac.jp>, "Grosso, Paul" <pgrosso@ptc.com>, public-iri@w3.org, Richard Ishida <ishida@w3.org>, Felix Sasaki <fsasaki@w3.org>, www-xml-linking-comments@w3.org, public-xml-core-wg@w3.org, public-i18n-core@w3.org
Richard Tobin scripsit: > I see XML discourages FDD*, but the ucschar excludes both FDD* and > FDE*. Does anyone know the reason for this discrepancy? FDE* seem to > be also "not a character". Almost certainly a blunder on my part. The correct range of non-characters is FDD0-FDEF. > ucschar also excludes E0***, which seem to be "tags" - what does that > mean? E0000-E007F are a clone of ASCII, dedicated to encoding language tags in plain text in a context where language tagging is considered essential but full markup too complex or expensive. Thus the language tag "en" would be encoded as E0065 E006E. These characters were born deprecated, and served to stave off the attempt of a certain IETF WG to abuse otherwise reserved UTF-8 forms to the same purpose. E0010-E01EF are variant selectors, attached to ordinary chaacters to specify variant forms of characters that are important or unpredictable in certain contexts, but in other contexts are equivalent to the forms without variant selectors. E01FF-E0FFF are reserved for other "default-ignorable" characters; processes that do not understand these characters ought to ignore them (and not render them as boxes, etc.). > ucschar also exclude FFF*, but XML makes no mention of them, except > of course FFFE and FFFF which aren't allowed in XML at all. FFF0-FFF8 are currently unassigned. FFF9-FFFB are used to do ruby in plain text, FFFC is a placeholder for a non-character object, and FFFD is used to replace an incoming character whose value is unknown or has no Unicode equivalent. We should issue an erratum for XML 1.0/1.1 adding FDE0-FDEF, E0000-E007F, and FFF0-FFFD to the discouraged characters list, as all of them have better equivalents in markup. Likewise, the characters 0340, 0341, 17A3, 17D3, and 206A-206F should be discouraged, as they are in Unicode. E0010-E01EF are still useful in XML character content, though probably not in *RIs. -- What has four pairs of pants, lives John Cowan in Philadelphia, and it never rains http://www.ccil.org/~cowan but it pours? cowan@ccil.org --Rufus T. Firefly
Received on Tuesday, 26 June 2007 15:43:15 UTC