W3C home > Mailing lists > Public > www-international@w3.org > January to March 2010

RE: For review: Character encodings in HTML and CSS

From: CE Whitehead <cewcathar@hotmail.com>
Date: Thu, 11 Feb 2010 14:46:47 -0500
Message-ID: <BLU109-W666A689F8EAE45947877EB34E0@phx.gbl>
To: <xn--mlform-iua@xn--mlform-iua.no>
CC: <ishida@w3.org>, <www-international@w3.org>

> Date: Thu, 11 Feb 2010 10:53:14 +0100
> From: xn--mlform-iua@xn--mlform-iua.no
> To: cewcathar@hotmail.com
> CC: ishida@w3.org; www-international@w3.org
> Subject: RE: For review: Character encodings in HTML and CSS
> CE Whitehead, Wed, 10 Feb 2010 17:20:04 -0500:
> > Also regarding the notepad BOM, is there anyway to get that thing out 
> > with an escape sequence, has anyone discovered that--
> > or maybe I could take it out by re-editing the file in word at the 
> > very end???
> > and then saving as a utf-8 text file??
> The NCR for BOM is '&#xfeff;'. One thing is whether it would work. 
> Probably not, because when you use NCRs then you don't indicate any 
> encoding. But anyhow: if you try to validate such a document, then you 
> will see that it is not valid to type '&#xfeff;' (or any other NCR) 
> before the !DOCTYPE declaration. 
O.k.,  I forgot that this would make my files display in 'quirks' mode sometimes;
but the BOM is not valid before the !DOCTYPE declaration either . . . 
. . .
> > 
> > Can one declare all character sets used in a document in the http header?
> Did you mean "any" and not "all"? Did you mean "charset" (singular) and 
> not "character sets"? 
> A HTML file can only declare one encoding - referred to in HTML code 
> and HTTP headers as "charset". When you use the META element to define 
> the encoding/charset (or "encoding char(aracter )set", as I would call 
> it), then you are in fact using HTTP vocabulary directly in HTML - note 
> the term http-equiv:
> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> 
> So, yes, HTTP can declare any encoding charset that HTML documents 
> could possibly have - which is only one per document. (Note that HTML5 
> proposes "<meta charset="utf-8">" as a less HTTP-ish way to define the 
> encoding charset - see Richard's article ...)
> Richard, perhaps you should point out, if you haven't done so already, 
> that a HTML/XML document only has one encoding.
> -- 
> leif halvard silli
I understand one charset per document myself (but you are right Leif, it might be good to point this out in the draft, so long as it is just this sentence that is added as the draft is getting a bit long).

However, according to
it's possible to set the header simultaneously for several documents
which might have very different character encodings (I mean charsets here).



when I go to test my http header declarations, I get the following:


"Accept-Charset: ISO-8859-1,UTF-8;q=0.7,*;q=0.7[CRLF]"


that's two charsets, right?

Elsewhere, I've seen it recommended elsewhere that I encode my documents as ansi, and then just use the Latin-1 char set (ISO 8859-1) with no escapes (assuming I can do this), and then declare my encoding as utf-8 anyway.  Will this work out o.k.? It will certainly eliminate the BOM in my files!  (I can give the source if you think this is a practice to recommend.)

(But good news; my online text editor has finally be upgraded;  the Cyrillic characters at least come out now; Arabic characters do not however--although the http header looks the same in both cases, when I test it online, and both documents have the utf-8 charset declared in the meta declaration!).  
(By the way I found this document very very helpful [albeit long] because of its links to the other documents, including http://www.w3.org/International/questions/qa-headers-charset which provides links that let me see how my headers are set when I don't control the server settings; I've got to see if it provides helpful information about escape sequences.)
C. E. Whitehead

Received on Thursday, 11 February 2010 19:47:20 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:31 UTC