Data format/encoding/character set (PI v MIME head) from Rick Jelliffe on 1996-09-19 (w3c-sgml-wg@w3.org from September 1996)

From: Rick Jelliffe <ricko@allette.com.au>
Date: Fri, 20 Sep 1996 02:22:52 +1000 (EST)
To: Gavin Nicol <gtn@ebt.com>
Cc: w3c-sgml-wg@w3.org
Message-Id: <Pine.ULT.3.90.960920021426.2648A-100000@chuckd.allette.com.au>

(repost due to finger trouble)
On Wed, 18 Sep 1996, Gavin Nicol wrote:

> This is my point. You *cannot* read the entity in unless you know the
> coded character set and encoding. 

It seems to me there are three basic data formats which character
encodings use: eight bits (8-bit fixed and 8bit variable), 16-bits (fixed and
variable) big-endian and 16-bit (fixed and variable) little-endian. If you
include fixed 32-bit character formats you only add another 3 (Intel order
endian, Motorola order-endian, PDP11 order-endian). 

Can you give me any examples of any character set encodings in use (not
compression, UUENCODE, etc) in which you can't reliably establish the data
format used (for coded character sets which have ASCII characters in the
ASCII code positions) if the first string in the file is "<?XML" ? 

Once one can establish the data format, one can read the PI and get the
charset/encoding in use. (I.e. this is not autodetecting the character
set, nor the encoding, but merely the basic data format {of the initially
appearing ASCII-valued characters}. If that is such a 'hack' why does Unicode
sepcifically have the byte-ordering mark characters to allow it?)

Rick Jelliffe            http://www.allette.com.au/allette/ricko
                         email: ricko@allette.com.au
================================================================
Allette Systems          http://www.allette.com.au
                         email: info@allette.com.au
10/91 York St, 2000,     phone: +61 2 9262 4777
Sydney, Australia        fax:   +61 2 9262 4774
================================================================

Received on Thursday, 19 September 1996 13:40:31 UTC