RE: For review: Character encodings in HTML and CSS

Notes below…

============
Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)

http://www.w3.org/International/
http://rishida.net/



From: CE Whitehead [mailto:cewcathar@hotmail.com] 
Sent: 10 February 2010 22:20
To: ishida@w3.org; www-international@w3.org
Subject: RE: For review: Character encodings in HTML and CSS

Hi!
 
Richard Ishida scripsit:
> Comments are being sought on this article prior to final release. Please
> send any comments to this list (www-international@w3.org). We expect
> to publish a final version in one to two weeks.
http://www.w3.org/International/tutorials/tutorial-char-enc/temp#Slide0100

I am halfway through this article (have skimmed the rest; will go through it
in a day or so;
I've not been through other people's comments completely either
but don't think there's but one that intersects with my almost strictly
stylistic comments--one of John's
cpmments and I've made a note of it);
most of my comments are on style.  
(KEY
!!! means that a change is essential;
!?! means that a change might be stylistically good but not essential;
??? means that a change is not essential--the issue is British English
versus American or something like that.)
* * *
!?! 
SECTION:  "essential definitions:  Unicode:  Character sets, coded character
sets, and encodings" par 4 1rst sentence
 
"The character encoding reflects the way the coded character set is mapped
to bytes for manipulation in a computer."

{COMMENT:  word choice; "computer" is too limiting}

=> ?

". . . manipulation by an application" 

I think this is even more restrictive.
 
* * *
!?!

SECTION "essential definitions:  Unicode:  Character sets, coded character
sets, and encodings" par 4 last sentence

"Note how the Tifinagh code points map to three bytes, but the exclamation
mark maps to a single byte."
 
{COMMENT I think a little explanation for why some characters have fewer
bytes is worth having}

=> { insert? } "which is also mapped in ISO 8859-1"
 
"Note how the Tifinagh code points map to three bytes, but the exclamation
mark, which is also mapped in ISO 8859-1, maps to a single byte."

I don't want to bring other encodings in at this point. It is enough that
there are different numbers of bytes per character in utf-8 at this point.

* * *

!!! 
SECTION "One character set, multiple encodings" par 7 1rst sentence

"In the following chart, the first line of numbers represents the position
of the characters in the Unicode coded character set. The other lines show
the byte values used to represent that character in a particular character
encoding."
 
{COMMENT:  NUMBER AGREEMENT ERROR
changing it so that
"that character" => "each character" 
lets us indicate that we've been talking about multiple characters--this
must be done because your first reference is to "characters" not
"character"}
 
thus
=>

"In the following chart, the first line of numbers represents the position
of the characters in the Unicode coded character set. The other lines show
the byte values used to represent each character in a particular character
encoding."

Changed, but differently.

* * *
??? 
SECTION "Choosing and applying an encoding:  Consider using a Unicode
encoding" par 2

"A Unicode encoding also allows many more languages to be mixed on a single
page than almost any other choice."
 
{ COMMENT AWK?? -- NEEDS MORE SPECIFIC WORDING??  also I'm not sure about
the ellipsis of the verb "allows" at the end --
that ellipsis could cause trouble in some cases, though here I think the
meaning is pretty clear }
 
=>
 
"A Unicode encoding allows many more languages to be mixed on a single page
than do most other encodings."

Added 'of encoding.'
* * *

CONTENT ONLY

SECTION "Choosing and applying an encoding:  Consider using a Unicode
encoding" par 3
 
"Any barriers to using Unicode are very low these days. In fact the HTML5
specification draft currently says "Authors are encouraged to use UTF-8.
Conformance checkers may advise authors against using legacy encodings.
Authoring tools should default to using UTF-8 for newly-created documents."
 
{COMMENT ON CONTENT

Obviously my text editors at various online hosting sites still default to
ascii I think--do you want to note this here??}

No.

* * *
!!! 
SECTION "Choosing and applying an encoding:  Consider using a Unicode
encoding par 4" (which is a note)
 
"(Note that support for a given encoding, especially one like Unicode, does
not necessarily imply that a user agent will correctly display the text.
Numerous scripts, such as Arabic and Indic, require additional rules to
transform the character sequence in memory to an appropriate sequence of
font glyphs for display.)"
 
{COMMENT: I think you omit parens around other notes in this document; your
style should be consistent throughout;
thus I'd align this comment/note at the end of a section with the style of
other notes/comments by removing the parens here;
see for example, "Character encoding names" in "how to declare a character
encoding (summary)"}
 
=>

"Note that support for a given encoding, especially one like Unicode, does
not necessarily imply that a user agent will correctly display the text.
Numerous scripts, such as Arabic and Indic, require additional rules to
transform the character sequence in memory to an appropriate sequence of
font glyphs for display."
 
Changed to a sidenote.

* * *
??? 
SECTION:  "XHTML treated as HTML" par 1

"If your content is written using an XHTML 1.0 or XHTML 1.1 doctype but sent
to a browser using the text/html MIME type, use a Content-Type meta element
to declare the encoding, as a minimum."
 
{COMMENT:  slightly awkward because of word order ???}

=> ?  "as a minimum, use a Content-Type . . ."

'as a minimum' changed to "You may choose to additionally use other
declarations."

* * *
!?! 
SECTION "how to declare a character encoding (summary)"
 
"Here we present a quick summary of how to declare character encodings in
the following formats: 
". . . 
"If you don't understand the summary advice, follow the links provided to
sections lower down the page which provide examples and explanations.
"No matter what format your content is in, you should also read the sections
on Character encoding names and HTTP just below."
 
{ COMMENTS:  (1), the last par could be more specific
by saying that you need to "read the sections that apply to how your content
is formatted and/or served;"
 
(2), there is no clear antecedent for the reference to the section,
"character encoding names," in the last paragraph--
this needs to be introduced more clearly a bit earlier ???
I solved (2) by inserting "More details are provided in tghe sections that
follow this summary"}
 
=>
 
"Here we present a quick summary of how to declare character encodings in
the following formats: 
". . . 
"More details are provided in the sections that follow this summary:   if
you don't understand the summary advice, follow the links provided to
sections lower down the page which provide examples and explanations.
"You should read the sections that apply to how your content is formatted
and/or served, and in all cases, should read the sections on Character
encoding names and HTTP just below."
* * *

???
SECTION "serving xhtml:  XHTML & MIME types" par 3
 
"Things are not so straightforward when dealing with XHTML, which is an
XML-based markup language. XML has a slightly different syntax to HTML"
 
=> ??

"Things are not so straightforward when dealing with XHTML, which is an
XML-based markup language. The syntax of XML is slightly different from that
of HTML"
 
{ COMMENT "different to" is distinctly British English; not sure if you want
to be that distinctly British; "different to" would be ungrammatical albeit
understandable in American English }

Done.
* * *
??? (John had a comment on the style here too)
 
SECTION "serving xhtml:  XHTML & MIME types" par 6
 
"Many developers prefer to use XHTML because of the advantages XML brings
for editing or processing of documents. However, because of the lack of
support for displaying XML files in mainstream browsers, many XHTML files
are actually served using the text/html MIME type. In this case, the user
agent will read the file as if it was HTML."

{ COMMENT: I feel that the subjunctive is better here, but again I speak
American English and you all don't like the subjunctive as much in British
English;
thus I'd say "as if it were" instead of "as if it was" }
 
Done.
=> ?
 
"Many developers prefer to use XHTML because of the advantages XML brings
for editing or processing of documents. However, because of the lack of
support for displaying XML files in mainstream browsers, many XHTML files
are actually served using the text/html MIME type. In this case, the user
agent will read the file as if it were HTML."

{ or can we get around this and say:
"will treat the file like an html file"???}
 
=>?
 
"Many developers prefer to use XHTML because of the advantages XML brings
for editing or processing of documents. However, because of the lack of
support for displaying XML files in mainstream browsers, many XHTML files
are actually served using the text/html MIME type. In this case, the user
agent will treat the file like an HTML file."

{ John suggested however,
"as if it was HTML" =>"as HTML".
}
* * *
SECTION  on quirks
 
John's suggestion is good:
 "you get quirks" => "you get quirks mode".}

* * *
!!! 
SECTION "What is the HTTP header?"  par 3, sentence 2

"If you are serving static files, this information can be associated with
the files by the server. "
 
{ COMMENT:  "by the server" sort of acts like a 'dangling modifier'--it
really modifies "can be associateed with" but is way over by files;
for clarity I'd try to move it; also a sentence like this is always better
and more readable active since you do have an agent of sorts--the server!}
 
=>

"If you are serving static files, the server can associate this information
with the files."

Done.
 	
* * *
CONTENT COMMENTS ONLY
 
You should mention not only the notepad editor inserts a bom but also the
wordpad editor.
Also as noted above online text editors support only the Latin-1 repertoire
or only support ascii -- maybe they've upgraded to Latin-1--I will check &
let you know next email;
so no I can't get the e with accent grave to display properly.
Also regarding the notepad BOM, is there anyway to get that thing out with
an escape sequence, has anyone discovered that--
or maybe I could take it out by re-editing the file in word at the very
end???
and then saving as a utf-8 text file??
* * *
OUT of CURIOSITY
Can one declare all character sets used in a document in the http header?
* * *
 
Best,
C. E. Whitehead
cewcathar@hotmail.com
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.733 / Virus Database: 271.1.1/2680 - Release Date: 02/10/10
19:38:00

Received on Friday, 19 February 2010 08:48:59 UTC