- From: Albert Lunde <Albert-Lunde@nwu.edu>
- Date: Thu, 25 Jul 1996 08:33:57 -0500 (CDT)
- To: carrasco@innet.lu (M.T. Carrasco Benitez)
- Cc: www-international@w3.org
> This is with the intention of putting together the last few messages: > > - Only one charset in allowed per document. > > - What SHOULD be the default "document character set" for HTML ? > Latin1, Unicode ... ? This is addressed in the internationalization spec drafts and more obliquely in the HTML 2.0 spec. (Once again, the SGML "document character set" has relatively little to do with the encoding used to transmit or store a document.) The short answer is, once you start dealing with "internationalization" always use ISO 10646 as described in section 2.2 of the i18n draft, regardless of the encoding. It's not a question of "default" as it never needs to change. (If you mean the default "charset" for an HTML document, you are talking about something completely different.) A longer answer, is that the HTML 2.0 spec contains some cleverly designed weasel-wording that allows you to comply with it and use a different SGML document character set, so long as it contains all the characters in ISO-8859-1 and that it's numeric character references agree with ISO-10646 (not as precisely specified as in the i18n draft.) It was written this way to allign with the intended direction of the i18n spec, while allowing room for experimentation. But if you want to comply with the i18n spec, you can safely ignore most of these messy details. The infamous notes from rfc1866.txt sect 1.2.1 that seem to get overlooked: = = * Its document character set includes [ISO-8859-1] and agrees with [ISO-10646]; that is, each code position listed in 13, "The HTML Coded Character Set" is included, and each code position in the document character set is mapped to the same character as [ISO-10646] designates for that code position. NOTE - The document character set is somewhat independent of the character encoding scheme used to represent a document. For example, the `ISO-2022-JP' character encoding scheme can be used for HTML documents, since its repertoire is a subset of the [ISO-10646] repertoire. The critical distinction is that numeric character references agree with [ISO-10646] regardless of how the document is encoded. = = > - How should be view: > + Many "document character sets" are allowed; e.g., ISO-8859-1, ISO-8859-7. > + Only (full 32 bits) 10646 is allowed. The others are subsets. If you want to comply with the i18n spec you use the one document character set specified there, even if your character encoding (MIME charset) is ISO-8859-1 or ISO-8859-7 or EBCDIC or UTF-8. You CANNOT use ISO-8859-7 as a SGML document character set and comply with the HTML 2.0 spec, because the numeric character references would be inconsistent with ISO-10646 for the upper half of the encoding. (Keeping numeric character references the same regardless of encoding is a key issue that motivates a lot of this.) But you can use ISO-8859-7 as an encoding. > - The charset for transmission SHOULD be whatever is appropriate for the data. > > - What is appropriate for the data ? > The client does not express any desire/restriction and the document is in > the server in ISO-8859-7. Should the server send it in ISO-8859-7 or > in Unicode ? > > - The server: "SHOULD or MUST ?" inform the client of the character set. I'm not sure we've defined "appropriate for the data". You _could_ send the full Unicode range of characters via US-ASCII encoding, if you are willing to use lots of numeric character references. I think HTTP 1.1 has tried to tighten up charset labeling. This is an HTTP/MIME issue, not strictly HTML. I'd say that sending unlabeled content is a widespread malpractice that we are trying to get rid of. If none of the HTTP mechanisms are used to specify the MIME charset, preferred in reply to a request, then I'm inclined to say it should be an implementation issue what the server sends. draft-ietf-http-v11-spec-06.txt says: "If no Accept-Charset header is present, the default is that any character set is acceptable. If an Accept-Charset header is present, and if the server cannot send a response which is acceptable according to the Accept-Charset header, then the server SHOULD send an error response with the 406 (not acceptable) status code, though the sending of an unacceptable response is also allowed." I'd guess from your prior writing, that you'd prefer that the server send one of the Unicode-related encodings, even if it has to translate. But one of the aims of the HTTP spec was to accomidate a wide range of servers, including dumb/fast "20-line" HTTP servers, which serve up the content they've got with no frills and no translation. Maybe you should be trying to write another spec for a "multilingual (something) server", which specifies more than the http minimum. > - Transmissions transformations are for compressing, encrypting > (content-encoding) or "safe transport" (transfer-coding). This is a > lower layer in the transmission. As long as the higher functions are > concerned, they are talking Unicode, Latin1, etc. Ok. > - LANG is for higher functions, such as short quotations, etc. Not just quotations, but that kind of stuff. > - The server SHOULD inform the client with Content-Language. I don't think this is what the HTTP spec says. To quote in part from the same source: "If no Content-Language is specified, the default is that the content is intended for all language audiences. This may mean that the sender does not consider it to be specific to any natural language, or that the sender does not know for which language it is intended." > - LANGs in the document overrides the Content-Language. Yes. > - There is no association between LANG and charset. Yes. > - I will do another posting regarding the more advance language > negotiations. When you say this, do you mean something beyond HTTP 1.1 content negotiation? -- Albert Lunde Albert-Lunde@nwu.edu
Received on Thursday, 25 July 1996 09:36:14 UTC