W3C home > Mailing lists > Public > www-international@w3.org > January to March 2010

Re: For review: Character encodings in HTML and CSS

From: John Cowan <cowan@ccil.org>
Date: Tue, 9 Feb 2010 16:45:59 -0500
To: Richard Ishida <ishida@w3.org>
Cc: www-international@w3.org
Message-ID: <20100209214558.GI29893@mercury.ccil.org>
Richard Ishida scripsit:

> Comments are being sought on this article prior to final release. Please
> send any comments to this list (www-international@w3.org). We expect
> to publish a final version in one to two weeks.

I'd avoid the term "character set" altogether in favor of "character
repertoire".

I'd add that character encodings are sometimes called "charsets".

Unfortunately we are stuck with the SGML term "document character set",
though "document coded character set" would be more correct.

You could add that coded character sets are sometimes called "code pages".

Since this is a tutorial, I would leave out UTF-32 altogether.
Nobody uses UTF-32 on the web.

Third graf of "The Document Character Set": for "and a subset" read
"and represents a subset".

In the first sentence of "Character escapes", for "an way" read "a way",
for "the the" read "the", and omit the comma.  In the second graf,
for "representing" read "directly representing".  In the third graf,
add comma after "then", or else remove comma after "CSS" (either is fine).

For "ie." read "i.e.", and for "eg." read "e.g." throughout.

In "Consider using a Unicode encoding", note that plain ASCII files are
already UTF-8.

"You may not have set the declarations that come with the HTTP header"
doesn't make sense to me.

In "Character encoding names", per above, for "not the character sets"
read "not the character repertoires or coded character sets".

For "MIME type" read "media type", or on the first use "MIME media type".

For "as if it was HTML" read "as HTML".

For "W3C standards interpretation" read "interpretation according to
W3 standards", to avoid the misreading "W3C standard interpretation"
(meaning the standard interpretation of the W3C, whatever that is).

For "you get quirks" read "you get quirks mode".

For "a small number of encodings" read "a few encodings".

In "The XML declaration", note that if anything (even whitespace)
precedes the XML declaration, it will not be recognized as such.
I don't know what "(or XML protocol") means; is that an error for "(or
XML processing instruction)"?  In any case, it should be left out.
"XML declaration" is the only standardized , and XML declarations are
not processing instructions in XML.

In the first graf of "The HTML5 meta charset element", omit the comma.

Given the constraints on the charset attribute of a/link/script, I'd
leave it out of a tutorial altogether.

I'd warn against character entity references in XHTML at all.  They are
not interoperable.

-- 
John Cowan   cowan@ccil.org   http://ccil.org/~cowan
I must confess that I have very little notion of what [s. 4 of the British
Trade Marks Act, 1938] is intended to convey, and particularly the sentence
of 253 words, as I make them, which constitutes sub-section 1.  I doubt if
the entire statute book could be successfully searched for a sentence of
equal length which is of more fuliginous obscurity. --MacKinnon LJ, 1940
Received on Tuesday, 9 February 2010 21:46:26 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 9 February 2010 21:46:28 GMT