- From: Paolo Candelari <paolo.candelari@fastwebnet.it>
- Date: Fri, 13 Feb 2004 12:21:45 +0100
- To: <www-html@w3.org>
I need help, please! I've a trouble/doubt with character encoding of HTML/XHTML documents. SGML declaration for HTML 4.01 defines ISO10646 as its character set (UCS-4). We can read from w3's web site: "The document character set for XML and HTML 4.0 is Unicode (aka ISO 10646). This means that HTML browsers and XML processors should behave as if they used Unicode internally." Character set is the "table" when I use a "character entity" inside documents: "...numeric character references are always resolved with respect to the fixed document character set, and thus to the same characters, whatever the external encoding actually used."(RFC2070). What about external character encoding? Again from RFC2070: "...the sequence of characters that constitutes an SGML document in the abstract sense are encoded by means of a sequence of octets (or occasionally bit groups of another length than 8) in a concrete realization of the document such as a computer file. This encoding is called the external character encoding of the concrete SGML document, and it should be carefully distinguished from the document character set of the abstract HTML document." For external character encoding we may use any character set standard that is best "cuted" for document's content (e.g. ISO-8859-1, SHIFT-JIS, ISO-8859-5, etc.). We know that (RFC2070): "HTML, as an application of SGML, does not directly address the question of the external character encoding. This is deferred to mechanisms external to HTML, such as MIME as used by the HTTP protocol or by electronic mail." (...) "For the HTTP protocol, the external character encoding is indicated by the "charset" parameter of the "Content-Type" field of the header of an HTTP response." (Content-Type: text/html; charset=...) "The term "charset" in MIME is used to designate a character encoding, rather than merely a coded character set as the term may suggest. A character encoding is a mapping (possibly many-to-one) of sequences of octets to sequences of characters taken from one or more character repertoires." Here is the problem: A character encoding is a mapping of sequences of octets to sequences of characters taken from one or more character repertoires. My doubt is: how UAs know the number of octects/bytes used by external character encoding? The answer (that is not a real answer is): UA read the HTTP header's "Content-Type" field. Yes, but how the server know how to set this field? (I only put my document into the server, I never say explicitly the character encoding). And supose server don't set that field... Other answer:(again from RFC2070) "In any document, it is possible to include an indication of the encoding scheme like the following, as early as possible within the HEAD of the document: <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=..."> but: "This is not foolproof, but will work if the encoding scheme is such that ASCII-valued octets stand for ASCII characters only at least until the META element is parsed." Here I think stay my answer! UA (and servers) read documents as they are ASCII coded: 1 byte by 1 byte. One exception: a to octecs code is used or a code that use a variable number of octects (UTF-8, SHIFT-JIS); but this kind of code use a starting sequence of forbidden bytes (I supose FF FE)or a others particulars sequences and so UAs can understand the it is a 2 (or variable) octec code. Is this the correct answer? Other question: HTML 4.01's default external character encoding what is? UTF-8? Because I can use a simple text editor (like Windows Notepad), use always charater entity references, and not declare the code use. What about XHTML UA? I apologize for my bad english! Thank you. Paolo ************************************************************************************** Questa e-mail, ed i suoi eventuali allegati, contengono informazioni confidenziali e riservate. Se avete ricevuto questa comunicazione per errore non utilizzatene il contenuto e non portatelo a conoscenza di alcuno. Siete inoltre pregati di eliminarla dalla vostra casella e avvisare il mittente. E' da rilevare inoltre che l'attuale infrastruttura tecnologica non puō garantire l'autenticitā del mittente, nč tantomeno l'integritā dei contenuti. Opinioni, conclusioni ed altre informazioni contenute nel messaggio possono rappresentare punti di vista personali a meno di diversa esplicita indicazione autorizzata. **************************************************************************************
Received on Friday, 13 February 2004 06:19:35 UTC