Re: For review: Character encodings for beginners from Andrew Cunningham on 2007-12-12 (www-international@w3.org from October to December 2007)

From: Andrew Cunningham <andrewc@vicnet.net.au>
Date: Wed, 12 Dec 2007 17:15:45 +1100
To: Najib Tounsi <ntounsi@emi.ac.ma>
CC: Richard Ishida <ishida@w3.org>, www-international@w3.org
Message-ID: <475F7C91.2050205@vicnet.net.au>
Hi Najib,


Najib Tounsi wrote:
> 
> Hi Richard,
> 
> I am still unclear about définition of encoding.
> 
> "Basically, [...] A character encoding is [...] is a set of mappings 
> between the bytes representing numbers in the computer and characters."
> 
> Mapping between Bytes (or codepoint) and characters
> 
> "Unfortunately, [...] many different [...] encodings, ie. many different 
> ways of mapping between bytes, codepoints and characters."
> 
> Mapping between bytes and codepoint on one hand, and between codepoint 
> and character on the other hand.
> 
> But, here is precisely my question. There are two levels of mappings:
> 
> Bytes <---> Code-points  <---> character-set
>       (1)                (2)
> 
> What is encoding? Mapping (1), (2) or (likely) the composition of the two?


i usually think in terms of

1) character set: a repertoire of characters
2) coded character set: characters in a repertoire are assigned codepoints
3) character encoding: how the codepoints are represented as a sequence 
of one or more bytes.

so for me encoding would be (1) in your diagram and coded character set 
would be (2)

logically I'd suggest the encoding is always step (2) regardless of 
whether its an 8-bit encoding or a multibyte encoding

To my way of thinking an 8-bit encoding is a simplification, the 
codepoint (in hex) corresponds exactly to the byte representation.

This doesn't make it any different form other encodings, just means 
there are certain cases (8-bit encodings) where the model can be 
simplified (i.e. step 91) and step 92) collapsed into one step), but 
this is a special case, rather than a different model or understanding

hope i conveying my thoughts in an understandable way.

> 
> Consider Unicode encoding vs ISO-8859-x encoding.
> 
> A) In the case of ISO-8859-x serie, mapping (1) is done by 
> OneByte=OneCodePoint, and mapping (2) is done by some table, depending 
> on the contexte (encoding?).
> 
> e.g.
> 223 <---> 223 <---> {é, Cyrillic Schna щ} depending on ISO-8859-{1, 5}
>     (1)       (2)
> 
> mapping (2) is the "encoding" (where we have multiple choice).
> 


but only if you treat every 8-bit encoding the same way?

but if the encoding is know and declared its actually 1-1.

It is always 1-1, the issue is if it is identified correctly. Since it 
may be confusable with other 8-bit encodings.

It also assume that by looking at the data you can't visually 
differentiate between encodings.

> 
> B) In the case of Unicode, mapping (1) is the encoding (there are 
> multiple choice) .
> 
> {"D1 89", 1097, etc.} <--->   1097    <---> Cyrillic Schna щ.
> {Utf-8, Utf-16, etc.} <---> Codepoint <---> Character-set.
>                       (1)             (2)
> 
> Here, mapping (2) between Codepoint and Character is One-to-One.
> 

and so is codepoint to encoding

the issue is that in 8-bit encodings have traditionally made the 
distinction between a character encoding and a coded character set 
irrelevant. But this si peculiar to 8-bit encodings.

It is possible to to describe 8-bit encodings in the same model as 
multibyte encodings, in terms of coded character set and character encoding.

part of the problem is that historically the 8-bit encodings have 
confused the situation, thus you get people referring to character 
encodings as character sets and charsets.


> 
> So, is it worth to show this two levels of mappings when talking about 
> encoding?
> 
> Note in passing, that Unicode encodings are good choice, because it is 
> the mapping from codepoint and character which is one-to-one.
> 

at least thats my 2 cents work. N.B. 2 cents no longer legal tender here.
-- 
Andrew Cunningham
Research and Development Coordinator (Vicnet)
State Library of Victoria
328 Swanston Street
Melbourne VIC 3000
Australia

Email: andrewc+AEA-vicnet.net.au
Alt. email: lang.support+AEA-gmail.com

Ph: +613-8664-7430                    Fax:+613-9639-2175
Mob: 0421-450-816

http://www.slv.vic.gov.au/            http://www.vicnet.net.au/
http://www.openroad.net.au/           http://www.mylanguage.gov.au/
http://home.vicnet.net.au/~andrewc/
Received on Wednesday, 12 December 2007 06:28:02 UTC