- From: Andrew Cunningham <andrewc@vicnet.net.au>
- Date: Wed, 12 Dec 2007 17:15:45 +1100
- To: Najib Tounsi <ntounsi@emi.ac.ma>
- CC: Richard Ishida <ishida@w3.org>, www-international@w3.org
Hi Najib,
Najib Tounsi wrote:
>
> Hi Richard,
>
> I am still unclear about définition of encoding.
>
> "Basically, [...] A character encoding is [...] is a set of mappings
> between the bytes representing numbers in the computer and characters."
>
> Mapping between Bytes (or codepoint) and characters
>
> "Unfortunately, [...] many different [...] encodings, ie. many different
> ways of mapping between bytes, codepoints and characters."
>
> Mapping between bytes and codepoint on one hand, and between codepoint
> and character on the other hand.
>
> But, here is precisely my question. There are two levels of mappings:
>
> Bytes <---> Code-points <---> character-set
> (1) (2)
>
> What is encoding? Mapping (1), (2) or (likely) the composition of the two?
i usually think in terms of
1) character set: a repertoire of characters
2) coded character set: characters in a repertoire are assigned codepoints
3) character encoding: how the codepoints are represented as a sequence
of one or more bytes.
so for me encoding would be (1) in your diagram and coded character set
would be (2)
logically I'd suggest the encoding is always step (2) regardless of
whether its an 8-bit encoding or a multibyte encoding
To my way of thinking an 8-bit encoding is a simplification, the
codepoint (in hex) corresponds exactly to the byte representation.
This doesn't make it any different form other encodings, just means
there are certain cases (8-bit encodings) where the model can be
simplified (i.e. step 91) and step 92) collapsed into one step), but
this is a special case, rather than a different model or understanding
hope i conveying my thoughts in an understandable way.
>
> Consider Unicode encoding vs ISO-8859-x encoding.
>
> A) In the case of ISO-8859-x serie, mapping (1) is done by
> OneByte=OneCodePoint, and mapping (2) is done by some table, depending
> on the contexte (encoding?).
>
> e.g.
> 223 <---> 223 <---> {é, Cyrillic Schna щ} depending on ISO-8859-{1, 5}
> (1) (2)
>
> mapping (2) is the "encoding" (where we have multiple choice).
>
but only if you treat every 8-bit encoding the same way?
but if the encoding is know and declared its actually 1-1.
It is always 1-1, the issue is if it is identified correctly. Since it
may be confusable with other 8-bit encodings.
It also assume that by looking at the data you can't visually
differentiate between encodings.
>
> B) In the case of Unicode, mapping (1) is the encoding (there are
> multiple choice) .
>
> {"D1 89", 1097, etc.} <---> 1097 <---> Cyrillic Schna щ.
> {Utf-8, Utf-16, etc.} <---> Codepoint <---> Character-set.
> (1) (2)
>
> Here, mapping (2) between Codepoint and Character is One-to-One.
>
and so is codepoint to encoding
the issue is that in 8-bit encodings have traditionally made the
distinction between a character encoding and a coded character set
irrelevant. But this si peculiar to 8-bit encodings.
It is possible to to describe 8-bit encodings in the same model as
multibyte encodings, in terms of coded character set and character encoding.
part of the problem is that historically the 8-bit encodings have
confused the situation, thus you get people referring to character
encodings as character sets and charsets.
>
> So, is it worth to show this two levels of mappings when talking about
> encoding?
>
> Note in passing, that Unicode encodings are good choice, because it is
> the mapping from codepoint and character which is one-to-one.
>
at least thats my 2 cents work. N.B. 2 cents no longer legal tender here.
--
Andrew Cunningham
Research and Development Coordinator (Vicnet)
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia
Email: andrewc+AEA-vicnet.net.au
Alt. email: lang.support+AEA-gmail.com
Ph: +613-8664-7430 Fax:+613-9639-2175
Mob: 0421-450-816
http://www.slv.vic.gov.au/ http://www.vicnet.net.au/
http://www.openroad.net.au/ http://www.mylanguage.gov.au/
http://home.vicnet.net.au/~andrewc/
Received on Wednesday, 12 December 2007 06:28:02 UTC