META character set specifications from Dany Talisman on 1998-02-09 (www-html-editor@w3.org from January to March 1998)

From: Dany Talisman <dani@capella.co.il>
Date: Mon, 09 Feb 1998 23:11:06 +0300
To: www-html-editor@w3.org
Message-ID: <34DF62DA.A1F92490@capella.co.il>
There is some unclarity and what seems to be a real problem with the
mechanism described in Section 5.2.2 of the current HTML 4.0
specification, with regard to using the META tag to specify a character
set. I couldn't resolve the following from the text as it currently
stands:

1. What character set is to be used to locate the META tag? There really
should be a requirement that this should always be in some standard
encoding, otherwise tools would be hard-pressed to parse the META tag
itself. There is a hint to that effect saying that "The META declaration
must only be used when the character encoding is such that ASCII
characters stand for themselves at least until the META element is
parsed.". However, this does not provide a general solution for the case
when the encoding is not compatible with ASCII characters (e.g., simple
16-bit unicode). How can a tool safely locate the META tag in such a
case? Not all HTML documents are accessed via an HTTP server; some are
read directly from the disk, which rules out any "external databases"
which could resolve this problem; I assume that there should be a
standard way of reading such files.

2. Exactly at which point is the encodings switch specified by a META
tag made? Immediately at the end of the META tag, or at the end of the
HEAD section? Given that the head section may need to contain text which
is not easily expressed in ASCII, I presume the intent is that the
change is made immediately after the '>' ending the META tag. This is
hinted by the recommendation that "META declarations should appear as
early as possible", but there's no explicit wording to that effect.

3. Suppose it is valid to use the META tag to change encoding at some
point in the text. Also suppose it is also valid for the whole document
to be in any encoding, without a META tag, given that the HTTP server
correctly reports its encoding in the header. This seems to be the
current state of the standard, as hinted by the fact that "Some servers
examine the first few bytes of the document, or check against a database
of known files and encodings" to determine encoding.

For documents of the first type, the HTTP server is expected to report
in the header the encoding specified in the META tag. However, the text
up to the META tag is encoded in ASCII. How is a user agent expected to
parse this text, if the encoding is not compatible with ASCII? As far as
the agent is concerned, such files look the same as files which use this
encoding from the first character, without a META tag.

I tried thinking of ways this could be handled under the current
standard or with minimal changes. All except the last suffer from severe
compatibility problems with current servers:

- Have the HTTP server encode the text up to the META tag using the
specified encoding. This raises another problem: if you read an HTML
file from the disk, you get a different result then reading it from an
HTTP server. Even worse, current servers will not do the encoding, new
ones will, with the result being that you'll never know exactly what to
expect.

- Have the HTTP server send a new header entry, which indicates that the
specified character is only used from some point onward. This would
require the user agent to search for the META tag to determine this
switching point. It still faces the compatibility problems with existing
servers, which would not send this data.

- Require that the character set specified in the HTTP header be used
from the first character. The META tag would cause a switch from this
initial character set to any new one. This would again require that user
agents to look for it. It would also require servers to make a special
case for the character set META tag, and not place it in the header;
again a compatibility problem with current servers.

- Likewise (use the HTTP header character set from the first character,
switching later on), but instead of using a META tag which affects the
HTTP header, use one which doesn't. The only disadvantage I could find
with this method is that it does not agree with the current standard :-)
IMVHO, this is the right choice, combined with a strong recomendation
that all files start with an ASCII encoding up to this tag, so as to be
usable directly as well as through HTTP servers.

Quite likely I'm missing something here, in which case a simple
clarification of the intent in section 5.2.2 would solve the problem.
Sorry if this is covered somewhere else; I tried looking in other
sections (the META tag definition, for example), but none seemed to give
a definite answer.

Thanks,
    Oren Ben-Kiki
    oren@capella.co.il
Received on Monday, 9 February 1998 16:11:37 UTC