[Prev][Next][Index][Thread]

Re: B.1 and B.2 results



In answering a long private email to Michael Sperberg-McQueen on the topic
of character encodeing and metatdata, I had a revelation as to the correct
way to do this, without blowing our compatibility with the rest of the
world.

   First a few premises of my reasoning:
   1: Character set declaration data must be parseable by everyone. "Bad
File Format" is an unacceptable error if what it _really means_ is
"Character set 'Bill's Cyrillic' unknown". Any character-set determining
prefix to a file must be a parseable series of bytes that everyone can
read.

   2: compatibility with HTTP's transport functionality is essential. We
should not talke the time, or the risk, of reinventing yet another way to
handle character sets.


   So, I suggest that we have the following problems:

  STEP 1: determine number of bytes in a character code (bit-combination).
The possiblities are 1 and 2 bytes/character (UTFs count as 8-bit codes for
our purposes). I don't remember if '\0' avoidance is still an issue. I'd
prefer to that we define it as a non-issue, and look at the first 2 bytes
of the input. Presence of a null indicates a 2-byte character set, and its
position indicates the endianity of the data, if we decide to allow this
option (I'd prefer to big-endian, and lose a case).

   (STEP 1'): Or we can use the FFFE hack, if that will have wider support.

   STEP 2: now we have a standard MIME-header, encoded in Latin1, using the
byte coding as determined in step 1 (i.e., 1-byte, big-endian, or
little-endian). We will define the fields that we _require_ (charset, only,
I hope). We will ignore, but permit, others. The CRLFCRLF sequence that
terminates the header will start the XML instance. This header will be
optional for channels (like HTTP) where the header information can be
preserved, and required for all others.

   This means any XML processor can produce the needed tagging information,
and it can always be parsed. If we go further, we could eliminate STEP 1,
and parser the header in 8-bit mode always. Then people can use their
standard codebase to parse the headers, but 2-byte systems might show the
header as garbage (proabably a _bad thing_).

   -- David

RE delenda est.

_________________________________________
David Durand              dgd@cs.bu.edu  \  david@dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________
http://www.dynamicdiagrams.com/services_map_main.html



Follow-Ups: