Character set and character encoding. from Paolo Candelari on 2004-02-13 (www-html@w3.org from February 2004)

From: Paolo Candelari <paolo.candelari@fastwebnet.it>
Date: Fri, 13 Feb 2004 12:21:45 +0100
To: <www-html@w3.org>
Message-ID: <002601c3f223$90d919b0$9494630a@servizi.rai.it>

I need help, please!

I've a trouble/doubt with character encoding of HTML/XHTML documents.
SGML declaration for HTML 4.01 defines ISO10646 as its character set
(UCS-4).
We can read from w3's web site:

"The document character set for XML and HTML 4.0 is Unicode (aka ISO
10646).
This means that HTML browsers and XML processors should behave
as if they used Unicode internally."

Character set is the "table" when I use a "character entity" inside
documents:

"...numeric character references are always resolved with respect to
the fixed document character set, and thus to the same characters,
whatever the external encoding actually used."(RFC2070).

What about external character encoding? Again from RFC2070:

"...the sequence of characters that constitutes an SGML document
in the abstract sense are encoded by means of a sequence of octets
(or occasionally bit groups of another length than 8) in a concrete
realization of the document such as a computer file.
This encoding is called the external character encoding of the
concrete SGML document, and it should be carefully distinguished
from the document character set of the abstract HTML document."

For external character encoding we may use any character set standard that
is best "cuted" for document's content (e.g. ISO-8859-1, SHIFT-JIS,
ISO-8859-5, etc.).

We know that (RFC2070):

"HTML, as an application of SGML, does not directly address the
question of the external character encoding. This is deferred to
mechanisms external to HTML, such as MIME as used by the HTTP
protocol or by electronic mail."
(...)
"For the HTTP protocol, the external character encoding is
indicated by the "charset" parameter of the "Content-Type" field of
the header of an HTTP response." (Content-Type: text/html; charset=...)
"The term "charset" in MIME is used to designate a character encoding,
rather than merely a coded character set as the term may suggest. A
character encoding is a mapping (possibly many-to-one) of sequences
of octets to sequences of characters taken from one or more character
repertoires."

Here is the problem: A character encoding is a mapping of sequences
of octets to sequences of characters taken from one or more character
repertoires.

My doubt is: how UAs know the number of octects/bytes used by external
character encoding?
The answer (that is not a real answer is): UA read the HTTP header's
"Content-Type" field.
Yes, but how the server know how to set this field? (I only put my document
into the server,
I never say explicitly the character encoding).

And supose server don't set that field...

Other answer:(again from RFC2070)

"In any document, it is possible to include an indication of the
encoding scheme like the following, as early as possible within the
HEAD of the document:

but:

"This is not foolproof, but will work if the encoding scheme is such
that ASCII-valued octets stand for ASCII characters only at least
until the META element is parsed."

Here I think stay my answer!

UA (and servers) read documents as they are ASCII coded: 1 byte by 1 byte.

One exception: a to octecs code is used or a code that use a variable
number of octects (UTF-8, SHIFT-JIS); but this kind of code use a starting
sequence
of forbidden bytes (I supose FF FE)or a others particulars sequences and so
UAs can
understand the it is a 2 (or variable) octec code.

Is this the correct answer?

Other question: HTML 4.01's default external character encoding what is?
UTF-8?
Because I can use a simple text editor (like Windows Notepad), use always
charater
entity references, and not declare the code use.

What about XHTML UA?

I apologize for my bad english!

Thank you.

Paolo

**************************************************************************************
Questa e-mail, ed i suoi eventuali allegati, contengono informazioni confidenziali e riservate.
Se avete ricevuto questa comunicazione per errore non utilizzatene il contenuto e non portatelo a conoscenza di alcuno.
Siete inoltre pregati di eliminarla dalla vostra casella e avvisare il mittente.
E' da rilevare inoltre che l'attuale infrastruttura tecnologica non può garantire l'autenticità del mittente, nè tantomeno l'integrità dei contenuti.

Opinioni, conclusioni ed altre informazioni contenute nel messaggio possono rappresentare punti di vista personali a meno di diversa esplicita indicazione autorizzata.

**************************************************************************************

Received on Friday, 13 February 2004 06:19:35 UTC