W3C home > Mailing lists > Public > www-html@w3.org > February 2004

Character set and character encoding.

From: Paolo Candelari <paolo.candelari@fastwebnet.it>
Date: Fri, 13 Feb 2004 12:21:45 +0100
Message-ID: <002601c3f223$90d919b0$9494630a@servizi.rai.it>
To: <www-html@w3.org>

I need help, please!

I've a trouble/doubt  with character encoding of HTML/XHTML documents.
SGML declaration for HTML 4.01 defines ISO10646 as its character set
(UCS-4).
We can read from w3's web site:

  "The document character set for XML and HTML 4.0 is Unicode (aka ISO
10646).
  This means that HTML browsers and XML processors should behave
  as if they used Unicode internally."

Character set is the "table" when I use a "character entity" inside
documents:

   "...numeric character references are always resolved with respect to
   the fixed document character set, and thus to the same characters,
   whatever the external encoding actually used."(RFC2070).

What about external character encoding? Again from RFC2070:

   "...the sequence of characters that constitutes an SGML document
   in the abstract sense are encoded by means of a sequence of octets
   (or occasionally bit groups of another length than 8) in a concrete
   realization of the document such as a computer file.
   This encoding is called the external character encoding of the
   concrete SGML document, and it should be carefully distinguished
   from the document character set of the abstract HTML document."

For external character encoding we may use any character set standard that
is best "cuted" for document's content (e.g. ISO-8859-1, SHIFT-JIS,
ISO-8859-5, etc.).

We know that (RFC2070):

   "HTML, as an application of SGML, does not directly address the
   question of the external character encoding. This is deferred to
   mechanisms external to HTML, such as MIME as used by the HTTP
   protocol or by electronic mail."
   (...)
   "For the HTTP protocol, the external character encoding is
   indicated by the "charset" parameter of the "Content-Type" field of
   the header of an HTTP response." (Content-Type: text/html; charset=...)
   "The term "charset" in MIME is used to designate a character encoding,
   rather than merely a coded character set as the term may suggest. A
   character encoding is a mapping (possibly many-to-one) of sequences
   of octets to sequences of characters taken from one or more character
   repertoires."

Here is the problem: A character encoding is a mapping of sequences
of octets to sequences of characters taken from one or more character
repertoires.

My doubt is: how UAs know the number of octects/bytes used by external
character encoding?
The answer (that is not a real answer is): UA read the HTTP header's
"Content-Type" field.
Yes, but how the server know how to set this field? (I only put my document
into the server,
I never say explicitly the character encoding).

And supose server don't set that field...

Other answer:(again from RFC2070)

   "In any document, it is possible to include an indication of the
   encoding scheme like the following, as early as possible within the
   HEAD of the document:

    <META HTTP-EQUIV="Content-Type"
     CONTENT="text/html; charset=...">

but:

   "This is not foolproof, but will work if the encoding scheme is such
   that ASCII-valued octets stand for ASCII characters only at least
   until the META element is parsed."

Here I think stay my answer!

UA (and servers) read documents as they are ASCII coded: 1 byte by 1 byte.

One exception: a to octecs code is used or a code that use a variable
number of octects (UTF-8, SHIFT-JIS); but this kind of code use a starting
sequence
of forbidden bytes (I supose FF FE)or a others particulars sequences and so
UAs can
understand the it is a 2 (or variable) octec code.

Is this the correct answer?

Other question: HTML 4.01's default external character encoding what is?
UTF-8?
Because I can use a simple text editor (like Windows Notepad), use always
charater
entity references, and not declare the code use.

What about XHTML UA?

I apologize for my bad english!

Thank you.

Paolo

**************************************************************************************
Questa e-mail, ed i suoi eventuali allegati, contengono informazioni confidenziali e riservate. 
Se avete ricevuto questa comunicazione per errore non utilizzatene il contenuto e non portatelo a conoscenza di alcuno.
Siete inoltre pregati di eliminarla dalla vostra casella e avvisare il mittente. 
E' da rilevare inoltre che l'attuale infrastruttura tecnologica non puō garantire l'autenticitā del mittente, nč tantomeno l'integritā dei contenuti.

Opinioni, conclusioni ed altre informazioni contenute nel messaggio possono rappresentare punti di vista personali a meno di diversa esplicita indicazione autorizzata.

**************************************************************************************
Received on Friday, 13 February 2004 06:19:35 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 5 February 2014 07:19:04 UTC