- From: David G. Durand <dgd@cs.bu.edu>
- Date: Sat, 19 Oct 1996 12:54:18 -0400
- To: w3c-sgml-wg@w3.org
In answering a long private email to Michael Sperberg-McQueen on the topic of character encodeing and metatdata, I had a revelation as to the correct way to do this, without blowing our compatibility with the rest of the world. First a few premises of my reasoning: 1: Character set declaration data must be parseable by everyone. "Bad File Format" is an unacceptable error if what it _really means_ is "Character set 'Bill's Cyrillic' unknown". Any character-set determining prefix to a file must be a parseable series of bytes that everyone can read. 2: compatibility with HTTP's transport functionality is essential. We should not talke the time, or the risk, of reinventing yet another way to handle character sets. So, I suggest that we have the following problems: STEP 1: determine number of bytes in a character code (bit-combination). The possiblities are 1 and 2 bytes/character (UTFs count as 8-bit codes for our purposes). I don't remember if '\0' avoidance is still an issue. I'd prefer to that we define it as a non-issue, and look at the first 2 bytes of the input. Presence of a null indicates a 2-byte character set, and its position indicates the endianity of the data, if we decide to allow this option (I'd prefer to big-endian, and lose a case). (STEP 1'): Or we can use the FFFE hack, if that will have wider support. STEP 2: now we have a standard MIME-header, encoded in Latin1, using the byte coding as determined in step 1 (i.e., 1-byte, big-endian, or little-endian). We will define the fields that we _require_ (charset, only, I hope). We will ignore, but permit, others. The CRLFCRLF sequence that terminates the header will start the XML instance. This header will be optional for channels (like HTTP) where the header information can be preserved, and required for all others. This means any XML processor can produce the needed tagging information, and it can always be parsed. If we go further, we could eliminate STEP 1, and parser the header in 8-bit mode always. Then people can use their standard codebase to parse the headers, but 2-byte systems might show the header as garbage (proabably a _bad thing_). -- David RE delenda est. _________________________________________ David Durand dgd@cs.bu.edu \ david@dynamicDiagrams.com Boston University Computer Science \ Sr. Analyst http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams --------------------------------------------\ http://dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________ http://www.dynamicdiagrams.com/services_map_main.html
Received on Saturday, 19 October 1996 12:49:46 UTC