Re: <character> property datatype from Tony Graham on 2002-07-31 (www-xsl-fo@w3.org from July 2002)

From: Tony Graham <Tony.Graham@Sun.COM>
Date: Wed, 31 Jul 2002 17:27:31 +0100
To: "'Www-Xsl-Fo" <www-xsl-fo@w3.org>
Message-ID: <15688.4083.898800.354093@tenso.ireland.sun.com>

Use a Char.

Do not use 'U+xxxx'.

Arved Sandstrom wrote at 29 Jul 2002 19:15:09 -0300:
 > A number of properties are typed as having <character> values: "character",
 > "grouping-separator", and "hyphenation-character".
 > 
 > <character> is described as being a single Unicode character, in Section
 > 5.11.
 > 
 > However, the property description for fo:character embellishes this rather
 > terse description, and says that a <character> specifies "the code point of
 > the Unicode character to be presented". To me this pretty clearly means a
 > specification of form U+xxxx.

Pick your Unicode version.  Prior to Unicode 3.1, 'U+xxxx' was a
'Unicode value.'  Today, "[i]n running text, an individual Unicode
code point can be expressed as U+n, where n is from four to six
hexadecimal digits..."

A 'character' property value is hardly running text.

On a different tack, is U+FB01, LATIN SMALL LIGATURE FI, one character
or two?  Either way, it is one code point.

See Section 3.4, Strings, of the Character Model for the World Wide
Web 1.0 [1].  A character is represented by a code point in Unicode,
but it ends up as one or more code units in your document.

 > With the other 2 properties this distinction is not made; we are left with
 > the idea that a Unicode character, as opposed to a codepoint (or code value;

i.e., one or more code units of n bits each.

 > the integer in other words), will be used. That is, if someone wished to use
 > a 3-octet UTF-8 encoded value that would seemingly be OK.

If the document is encoded in UTF-8.

 > "grouping-separator" is defined wrt XSLT, where it is a single instance of
 > the XML 'Char' production, that is, a Unicode character, either UTF-8 or
 > UTF-16 encoded (at a  minimum), or specified as #x9 | #xA | #xD |

Same encoding as the rest of the document.

 > [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF].

If you can't represent it in the current encoding, use a numeric
character reference.

This does raise the interesting question of what happens if I need to
use a base character plus combining characters to make my grouping
separator?  It seems you can only use precomposed characters for
grouping separator.

 > So our (myself and Eric Bischoff) question is, what have other implementors
 > elected to use?

Regards,


Tony Graham
------------------------------------------------------------------------
XML Technology Center - Dublin                mailto:tony.graham@sun.com
Sun Microsystems Ireland Ltd                       Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3            x(70)19708


[1] http://www.w3.org/TR/charmod/#sec-Strings

Received on Wednesday, 31 July 2002 12:24:29 UTC