- From: Andrew Cunningham <andrewc@vicnet.net.au>
- Date: Wed, 12 Dec 2007 17:15:45 +1100
- To: Najib Tounsi <ntounsi@emi.ac.ma>
- CC: Richard Ishida <ishida@w3.org>, www-international@w3.org
Hi Najib, Najib Tounsi wrote: > > Hi Richard, > > I am still unclear about définition of encoding. > > "Basically, [...] A character encoding is [...] is a set of mappings > between the bytes representing numbers in the computer and characters." > > Mapping between Bytes (or codepoint) and characters > > "Unfortunately, [...] many different [...] encodings, ie. many different > ways of mapping between bytes, codepoints and characters." > > Mapping between bytes and codepoint on one hand, and between codepoint > and character on the other hand. > > But, here is precisely my question. There are two levels of mappings: > > Bytes <---> Code-points <---> character-set > (1) (2) > > What is encoding? Mapping (1), (2) or (likely) the composition of the two? i usually think in terms of 1) character set: a repertoire of characters 2) coded character set: characters in a repertoire are assigned codepoints 3) character encoding: how the codepoints are represented as a sequence of one or more bytes. so for me encoding would be (1) in your diagram and coded character set would be (2) logically I'd suggest the encoding is always step (2) regardless of whether its an 8-bit encoding or a multibyte encoding To my way of thinking an 8-bit encoding is a simplification, the codepoint (in hex) corresponds exactly to the byte representation. This doesn't make it any different form other encodings, just means there are certain cases (8-bit encodings) where the model can be simplified (i.e. step 91) and step 92) collapsed into one step), but this is a special case, rather than a different model or understanding hope i conveying my thoughts in an understandable way. > > Consider Unicode encoding vs ISO-8859-x encoding. > > A) In the case of ISO-8859-x serie, mapping (1) is done by > OneByte=OneCodePoint, and mapping (2) is done by some table, depending > on the contexte (encoding?). > > e.g. > 223 <---> 223 <---> {é, Cyrillic Schna щ} depending on ISO-8859-{1, 5} > (1) (2) > > mapping (2) is the "encoding" (where we have multiple choice). > but only if you treat every 8-bit encoding the same way? but if the encoding is know and declared its actually 1-1. It is always 1-1, the issue is if it is identified correctly. Since it may be confusable with other 8-bit encodings. It also assume that by looking at the data you can't visually differentiate between encodings. > > B) In the case of Unicode, mapping (1) is the encoding (there are > multiple choice) . > > {"D1 89", 1097, etc.} <---> 1097 <---> Cyrillic Schna щ. > {Utf-8, Utf-16, etc.} <---> Codepoint <---> Character-set. > (1) (2) > > Here, mapping (2) between Codepoint and Character is One-to-One. > and so is codepoint to encoding the issue is that in 8-bit encodings have traditionally made the distinction between a character encoding and a coded character set irrelevant. But this si peculiar to 8-bit encodings. It is possible to to describe 8-bit encodings in the same model as multibyte encodings, in terms of coded character set and character encoding. part of the problem is that historically the 8-bit encodings have confused the situation, thus you get people referring to character encodings as character sets and charsets. > > So, is it worth to show this two levels of mappings when talking about > encoding? > > Note in passing, that Unicode encodings are good choice, because it is > the mapping from codepoint and character which is one-to-one. > at least thats my 2 cents work. N.B. 2 cents no longer legal tender here. -- Andrew Cunningham Research and Development Coordinator (Vicnet) State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Email: andrewc+AEA-vicnet.net.au Alt. email: lang.support+AEA-gmail.com Ph: +613-8664-7430 Fax:+613-9639-2175 Mob: 0421-450-816 http://www.slv.vic.gov.au/ http://www.vicnet.net.au/ http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://home.vicnet.net.au/~andrewc/
Received on Wednesday, 12 December 2007 06:28:02 UTC