Re: [bp-i18n-specdev] Editorial comments on character definitions

Here are definitions in the Unicode glossary.  I usually find that these are pretty clear and reliable, and so worth relying on for our own needs.

Character https://www.unicode.org/glossary/#character
> (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]

Character encoding form https://www.unicode.org/glossary/#character_encoding_form
>  Mapping from a character set definition to the actual code units used to represent the data.

Character set https://www.unicode.org/glossary/#character_set
> A collection of elements used to represent textual information.

Code point https://www.unicode.org/glossary/#code_point
> 1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a character, in any coded character set.

Code unit https://www.unicode.org/glossary/#code_unit
> The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. (See definition D77 in Section 3.9, Unicode Encoding Forms.) 

Extended grapheme cluster https://www.unicode.org/glossary/#extended_grapheme_cluster
> The text between extended grapheme cluster boundaries as specified by Unicode Standard Annex #29, "Unicode Text Segmentation." Abbreviated as EGC. (See definition D61 in Section 3.6, Combination.)

Glyph https://www.unicode.org/glossary/#glyph
> (1) An abstract form that represents one or more glyph images. (2) A synonym for glyph image. In displaying Unicode character data, one or more glyphs may be selected to depict a particular character. These glyphs are selected by a rendering engine during composition and layout processing. (See also character.)

Glyph image https://www.unicode.org/glossary/#glyph_image
> The actual, concrete image of a glyph representation having been rasterized or otherwise imaged onto some display surface. 

Grapheme https://www.unicode.org/glossary/#grapheme
> (1) A minimally distinctive unit of writing in the context of a particular writing system. For example, ‹b› and ‹d› are distinct graphemes in English writing systems because there exist distinct words like big and dig. Conversely, a lowercase italiform letter a and a lowercase Roman letter a are not distinct graphemes because no word is distinguished on the basis of these two different forms. (2) What a user thinks of as a character.

Grapheme cluster https://www.unicode.org/glossary/#grapheme_cluster
> The text between grapheme cluster boundaries as specified by Unicode Standard Annex #29, "Unicode Text Segmentation." (See definition D60 in Section 3.6, Combination.) A grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it.

User-perceived character https://www.unicode.org/glossary/#user_perceived_character
> What everyone thinks of as a character in their script.

-- 
GitHub Notification of comment by r12a
Please view or discuss this issue at https://github.com/w3c/bp-i18n-specdev/issues/28#issuecomment-411764660 using your GitHub account

Received on Thursday, 9 August 2018 13:51:33 UTC