- From: <lee@sq.com>
- Date: Sat, 19 Oct 96 18:29:59 EDT
- To: dgd@cs.bu.edu, w3c-sgml-wg@w3.org
> So, I suggest that we have the following problems: > > STEP 1: determine number of bytes in a character code (bit-combination). [...] > > (STEP 1'): Or we can use the FFFE hack, if that will have wider support. Probably, since it is explicitly suggested by Unicode. Whether you call it a hack or not depends on how you feel abut it, I think :-) > STEP 2: now we have a standard MIME-header, encoded in Latin1, using the > byte coding as determined in step 1 (i.e., 1-byte, big-endian, or > little-endian). Not yet. We have looked at the first two bytes of the file and determined whether they are using the Unicode/10646 EFFF/FFFE option. We have yet to determine (1) that the files are in XML at all and not HTML or SGML or GENCODE... (2) if they are in XML, what version of XML do they use? with talk of an XML 2.0, this is important. > We will define the fields that we _require_ (charset, only, > I hope). We will ignore, but permit, others. The CRLFCRLF sequence that > terminates the header will start the XML instance. This header will be > optional for channels (like HTTP) where the header information can be > preserved, and required for all others. OK, I see now. You are suggesting that we put a MIME header in the document in all cases. I think this is an excellent suggestion. Note that many existing web servers (including Apache) cope with files containing MIME headers, and may even emit those headers in response to an HTPP HEAD request. Apache is said (independently) to represent over 30% of all running web servers. > If we go further, we could eliminate STEP 1, > and parser the header in 8-bit mode always. Then people can use their > standard codebase to parse the headers, but 2-byte systems might show the > header as garbage It's always a little tricky to talk about mixing character sets within a single file. However, since MIME headers are in US ASCII (or is Latin 1 allowed now?), the headers must be in the subset common to both. As long a you don't include people's names, filenames, document titles or other displayable information in the MIME header in an XML file, though, I don't imagine it would be a problem. James? Gavin? Martin? At a minimum, you would need Mime-version: 1.0 Content-type: text/x-xml;version=1.0;charset=utf-8? If we get text/xml registered as MIME content type, it can be text/xml instead of text/x-xml, which would be good (I assume the leading x is OK!). Instead of requiring the full MIME CR-LF at the end of each line (which is a pain to mantain on some platforms, e.g. Mac and Unix), I would suggest documenting a format in which * The lines are terminated by either NL or CR * Lines longer than 72 characters may be continued by starting the next line with spaces or tabs (HT) (as per RFC 822) * The header is terminated by CR LF CR LF CR CR or LF LF i.e. you must have two of the same kind. You then get a header format which can easily and reliably be edited on multiple platforms -- e.g. you can upload a file from your PC to a Unix, NT or Mac server, and make a quick change in Notepad, Sam, or whatever, without trashing the file header. Of course, the editor has to be able to write out the body of the file correctly! (note that ftp in text mode will silently transform line endings into the format preferred on the destination platform, RS/RE rules notwithstanding; when you receive a file of format text/plain (e.g. 'cos the web server isn't set up to recognise some content type or other), the same transformation is made, which is why images don't transfer correctly if they have odd filenmes or are in new formats the server doesn't grok; it would be useful if XML were robust against such things as much as possible) Lee
Received on Saturday, 19 October 1996 18:31:23 UTC