Re: LANG + chars

> This is with the intention of putting together the last few messages:
> 
> - Only one charset in allowed per document.
> 
> - What SHOULD be the default "document character set" for HTML ? 
>   Latin1, Unicode ... ?

This is addressed in the internationalization spec drafts and more
obliquely in the HTML 2.0 spec.

(Once again, the SGML "document character set" has relatively little
to do with the encoding used to transmit or store a document.)

The short answer is, once you start dealing with "internationalization"
always use ISO 10646 as described in section 2.2 of the i18n draft,
regardless of the encoding. It's not a question of "default"
as it never needs to change.

(If you mean the default "charset" for an HTML document, you are
talking about something completely different.)

A longer answer, is that the HTML 2.0 spec contains some cleverly
designed weasel-wording that allows you to comply with it and
use a different SGML document character set, so long as it contains
all the characters in ISO-8859-1 and that it's numeric character
references agree with ISO-10646 (not as precisely specified as in
the i18n draft.) It was written this way to allign with the intended
direction of the i18n spec, while allowing room for experimentation.

But if you want to comply with the i18n spec, you can safely ignore
most of these messy details.

The infamous notes from rfc1866.txt sect 1.2.1 that seem to get overlooked:
= =
* Its document character set includes [ISO-8859-1] and
        agrees with [ISO-10646]; that is, each code position listed
        in 13, "The HTML Coded Character Set" is included, and each
        code position in the document character set is mapped to the
        same character as [ISO-10646] designates for that code
        position.

            NOTE - The document character set is somewhat
            independent of the character encoding scheme used to
            represent a document. For example, the `ISO-2022-JP'
            character encoding scheme can be used for HTML
            documents, since its repertoire is a subset of the
            [ISO-10646] repertoire. The critical distinction is
            that numeric character references agree with
            [ISO-10646] regardless of how the document is
            encoded.

= =


> - How should be view:
>   + Many "document character sets" are allowed; e.g., ISO-8859-1, ISO-8859-7.
>   + Only (full 32 bits) 10646 is allowed.  The others are subsets.
 
If you want to comply with the i18n spec you use the one document
character set specified there, even if your character encoding (MIME
charset) is ISO-8859-1 or ISO-8859-7 or EBCDIC or UTF-8.

You CANNOT use ISO-8859-7 as a SGML document character set and comply
with the HTML 2.0 spec, because the numeric character references would
be inconsistent with ISO-10646 for the upper half of the encoding.
(Keeping numeric character references the same regardless of
encoding is a key issue that motivates a lot of this.)

But you can use ISO-8859-7 as an encoding.


> - The charset for transmission SHOULD be whatever is appropriate for the data.
> 
> - What is appropriate for the data ?
>   The client does not express any desire/restriction and the document is in 
>   the server in ISO-8859-7.  Should the server send it in ISO-8859-7 or 
>   in Unicode ?
> 
> - The server: "SHOULD or MUST ?" inform the client of the character set.

I'm not sure we've defined "appropriate for the data". You _could_ send
the full Unicode range of characters via US-ASCII encoding, if you
are willing to use lots of numeric character references.

I think HTTP 1.1 has tried to tighten up charset labeling. This is an
HTTP/MIME issue, not strictly HTML. 

I'd say that sending unlabeled content is a widespread malpractice that we
are trying to get rid of.

If none of the HTTP mechanisms are used to specify the MIME charset,
preferred in reply to a request, then I'm inclined to say it
should be an implementation issue what the server sends. 

draft-ietf-http-v11-spec-06.txt says:
"If no Accept-Charset header is present, the default is that any
character set is acceptable. If an Accept-Charset header is present, and
if the server cannot send a response which is acceptable according to
the Accept-Charset header, then the server SHOULD send an error response
with the 406 (not acceptable) status code, though the sending of an
unacceptable response is also allowed."

I'd guess from your prior writing, that you'd prefer that the server
send one of the Unicode-related encodings, even if it has to translate.

But one of the aims of the HTTP spec was to accomidate a wide range
of servers, including dumb/fast "20-line" HTTP servers, which serve
up the content they've got with no frills and no translation.

Maybe you should be trying to write another spec for a "multilingual
(something) server", which specifies more than the
http minimum.

> - Transmissions transformations are for compressing, encrypting
> (content-encoding) or "safe transport" (transfer-coding).  This is a 
> lower layer in the transmission.  As long as the higher functions are 
> concerned, they are talking Unicode, Latin1, etc.

Ok.
 
> - LANG is for higher functions, such as short quotations, etc.

Not just quotations, but that kind of stuff.
 
> - The server SHOULD inform the client with Content-Language.

I don't think this is what the HTTP spec says. To quote in part
from the same source:

"If no Content-Language is specified, the default is that the content is
intended for all language audiences. This may mean that the sender does
not consider it to be specific to any natural language, or that the
sender does not know for which language it is intended."

> - LANGs in the document overrides the Content-Language.
 
Yes.
 
> - There is no association between LANG and charset.

Yes. 
> - I will do another posting regarding the more advance language 
> negotiations.

When you say this, do you mean something beyond HTTP 1.1 content
negotiation?

-- 
    Albert Lunde                      Albert-Lunde@nwu.edu

Received on Thursday, 25 July 1996 09:36:14 UTC