RE: For review: Character encodings in HTML and CSS from CE Whitehead on 2010-02-10 (www-international@w3.org from January to March 2010)

From: CE Whitehead <cewcathar@hotmail.com>
Date: Wed, 10 Feb 2010 17:20:04 -0500
To: <ishida@w3.org>, <www-international@w3.org>
Message-ID: <BLU109-W9814DEC9C997BD5FFFABDB34F0@phx.gbl>

Hi!

Richard Ishida scripsit:

> Comments are being sought on this article prior to final release. Please
> send any comments to this list (www-international@w3.org). We expect
> to publish a final version in one to two weeks.

http://www.w3.org/International/tutorials/tutorial-char-enc/temp#Slide0100

I am halfway through this article (have skimmed the rest; will go through it in a day or so;

I've not been through other people's comments completely either
but don't think there's but one that intersects with my almost strictly stylistic comments--one of John's

cpmments and I've made a note of it);
most of my comments are on style.

(KEY

!!! means that a change is essential;

!?! means that a change might be stylistically good but not essential;

??? means that a change is not essential--the issue is British English versus American or something like that.)

* * *

!?!

SECTION: "essential definitions: Unicode: Character sets, coded character sets, and encodings" par 4 1rst sentence

"The character encoding reflects the way the coded character set is mapped to bytes for manipulation in a computer."

{COMMENT: word choice; "computer" is too limiting}

=> ?

". . . manipulation by an application"

* * *

!?!

SECTION "essential definitions: Unicode: Character sets, coded character sets, and encodings" par 4 last sentence

"Note how the Tifinagh code points map to three bytes, but the exclamation mark maps to a single byte."

{COMMENT I think a little explanation for why some characters have fewer bytes is worth having}

=> { insert? } "which is also mapped in ISO 8859-1"

"Note how the Tifinagh code points map to three bytes, but the exclamation mark, which is also mapped in ISO 8859-1, maps to a single byte."

* * *

!!!

SECTION "One character set, multiple encodings" par 7 1rst sentence

"In the following chart, the first line of numbers represents the position of the characters in the Unicode coded character set. The other lines show the byte values used to represent that character in a particular character encoding."

{COMMENT: NUMBER AGREEMENT ERROR
changing it so that

"that character" => "each character"
lets us indicate that we've been talking about multiple characters--this must be done because your first reference is to "characters" not "character"}

thus

"In the following chart, the first line of numbers represents the position of the characters in the Unicode coded character set. The other lines show the byte values used to represent each character in a particular character encoding."

* * *
???

SECTION "Choosing and applying an encoding: Consider using a Unicode encoding" par 2

"A Unicode encoding also allows many more languages to be mixed on a single page than almost any other choice."

{ COMMENT AWK?? -- NEEDS MORE SPECIFIC WORDING?? also I'm not sure about the ellipsis of the verb "allows" at the end --

that ellipsis could cause trouble in some cases, though here I think the meaning is pretty clear }

"A Unicode encoding allows many more languages to be mixed on a single page than do most other encodings."

* * *

CONTENT ONLY

SECTION "Choosing and applying an encoding: Consider using a Unicode encoding" par 3

"Any barriers to using Unicode are very low these days. In fact the HTML5 specification draft currently says "Authors are encouraged to use UTF-8. Conformance checkers may advise authors against using legacy encodings. Authoring tools should default to using UTF-8 for newly-created documents."

{COMMENT ON CONTENT

Obviously my text editors at various online hosting sites still default to ascii I think--do you want to note this here??}

* * *
!!!
SECTION "Choosing and applying an encoding: Consider using a Unicode encoding par 4" (which is a note)

"(Note that support for a given encoding, especially one like Unicode, does not necessarily imply that a user agent will correctly display the text. Numerous scripts, such as Arabic and Indic, require additional rules to transform the character sequence in memory to an appropriate sequence of font glyphs for display.)"

{COMMENT: I think you omit parens around other notes in this document; your style should be consistent throughout;
thus I'd align this comment/note at the end of a section with the style of other notes/comments by removing the parens here;
see for example, "Character encoding names" in "how to declare a character encoding (summary)"}

"Note that support for a given encoding, especially one like Unicode, does not necessarily imply that a user agent will correctly display the text. Numerous scripts, such as Arabic and Indic, require additional rules to transform the character sequence in memory to an appropriate sequence of font glyphs for display."

* * *
???

SECTION: "XHTML treated as HTML" par 1

"If your content is written using an XHTML 1.0 or XHTML 1.1 doctype but sent to a browser using the text/html MIME type, use a Content-Type meta element to declare the encoding, as a minimum."

{COMMENT: slightly awkward because of word order ???}

=> ? "as a minimum, use a Content-Type . . ."

* * *

!?!
SECTION "how to declare a character encoding (summary)"

"Here we present a quick summary of how to declare character encodings in the following formats:

". . .

"If you don't understand the summary advice, follow the links provided to sections lower down the page which provide examples and explanations.

"No matter what format your content is in, you should also read the sections on Character encoding names and HTTP just below."

{ COMMENTS: (1), the last par could be more specific
by saying that you need to "read the sections that apply to how your content is formatted and/or served;"

(2), there is no clear antecedent for the reference to the section, "character encoding names," in the last paragraph--
this needs to be introduced more clearly a bit earlier ???
I solved (2) by inserting "More details are provided in tghe sections that follow this summary"}

"Here we present a quick summary of how to declare character encodings in the following formats:

". . .

"More details are provided in the sections that follow this summary: if you don't understand the summary advice, follow the links provided to sections lower down the page which provide examples and explanations.

"You should read the sections that apply to how your content is formatted and/or served, and in all cases, should read the sections on Character encoding names and HTTP just below."

* * *

???

SECTION "serving xhtml: XHTML & MIME types" par 3

"Things are not so straightforward when dealing with XHTML, which is an XML-based markup language. XML has a slightly different syntax to HTML"

=> ??

"Things are not so straightforward when dealing with XHTML, which is an XML-based markup language. The syntax of XML is slightly different from that of HTML"

{ COMMENT "different to" is distinctly British English; not sure if you want to be that distinctly British; "different to" would be ungrammatical albeit understandable in American English }

* * *

??? (John had a comment on the style here too)

SECTION "serving xhtml: XHTML & MIME types" par 6

"Many developers prefer to use XHTML because of the advantages XML brings for editing or processing of documents. However, because of the lack of support for displaying XML files in mainstream browsers, many XHTML files are actually served using the text/html MIME type. In this case, the user agent will read the file as if it was HTML."

{ COMMENT: I feel that the subjunctive is better here, but again I speak American English and you all don't like the subjunctive as much in British English;
thus I'd say "as if it were" instead of "as if it was" }

=> ?

{ or can we get around this and say:

"will treat the file like an html file"???}

=>?

{ John suggested however,
"as if it was HTML" =>"as HTML".
}
* * *

SECTION on quirks

John's suggestion is good:

"you get quirks" => "you get quirks mode".}

* * *
!!!

SECTION "What is the HTTP header?" par 3, sentence 2

"If you are serving static files, this information can be associated with the files by the server. "

{ COMMENT: "by the server" sort of acts like a 'dangling modifier'--it really modifies "can be associateed with" but is way over by files;
for clarity I'd try to move it; also a sentence like this is always better and more readable active since you do have an agent of sorts--the server!}

"If you are serving static files, the server can associate this information with the files."

* * *

CONTENT COMMENTS ONLY

You should mention not only the notepad editor inserts a bom but also the wordpad editor.

Also as noted above online text editors support only the Latin-1 repertoire or only support ascii -- maybe they've upgraded to Latin-1--I will check & let you know next email;

so no I can't get the e with accent grave to display properly.

Also regarding the notepad BOM, is there anyway to get that thing out with an escape sequence, has anyone discovered that--
or maybe I could take it out by re-editing the file in word at the very end???
and then saving as a utf-8 text file??
* * *
OUT of CURIOSITY

Can one declare all character sets used in a document in the http header?

* * *

Best,

C. E. Whitehead
cewcathar@hotmail.com

Received on Wednesday, 10 February 2010 22:20:37 UTC