W3C home > Mailing lists > Public > public-i18n-geo@w3.org > October 2005

Comments on Deborah's FAQ (for discussion in telecon)

From: Richard Ishida <ishida@w3.org>
Date: Wed, 5 Oct 2005 17:42:15 +0100
To: "GEO" <public-i18n-geo@w3.org>
Message-Id: <20051005164214.2ED304EF90@homer.w3.org>

From: Deborah Cawkwell [mailto:deborah.cawkwell@bbc.co.uk] 
Sent: 05 October 2005 17:31
To: ishida@w3.org
Subject: Comments on FAQ 

Wed 7 Sept 2005

RE: New article for REVIEW: Upgrading from language-specific legacy encoding
to Unicode encoding


Frank Yung-Fong Tang
Tue 8/23/2005 20:02

I think you should mention not only charset with HTML, but also issue with
CSS and seperate JavaScript file. The issue with \ unicode in CSS is quite

DC ACTION: don't know about this. Research &/or ask Frank.
No trickiness mentioned in:

FAQ: CSS character encoding declarations How do I declare the character
encoding inside a CSS (Cascading Style Sheets) style sheet?


Jony Rosenne
Tue 8/23/2005 22:18

I suggest that this article should at least mention the problems of legacy
conversion to Unicode specific to bidi, i.e. visual order vs. logical order.

HTML supports this in two ways:

1) Specifying ISO-8859-8 as the character set indicates visual order, and my
recommendation is to leave these pages alone.

2) If you must upgrade, use the BDO tag.

See http://www.w3.org/TR/html401/struct/dirlang.html#h-8.2.4


Some quick feedback.

 >Modern operating systems support Unicode:

This is in a funny order, in the middle of a section about fonts, without
even a header to set it off.

 >Unicode: the operating system or browser has fonts

Should mention that many programs will using 'fallback' mechanisms; where a
font doesn't have all the glyphs, it will switch fonts.

 >Page weight / download cost is not really an issue: given that a large
proportion of a web page is HTML mark-up, where characters remain 1 byte,

Give example page & sizes in legacy & Unicode.

 >Characters that do not fall into the ASCII range, such as Chinese, Arabic,
Russian, may use 2 or even 3 bytes. Chinese encodings already use more than
1 byte per character with legacy encodings, where they use double bytes.

Treat CJK in separate bullet

 >rather than with a legacy encoding where the source text is not readable
and uses different characters to point to code points.

 >Server side applications
Server-side applications
[Otherwise it is a side application that has to do with servers]

Suggest passing this by the UTC (unicode@unicode.org) for feedback.



Frank Ellermann
Wed 8/24/2005 13:48

Richard Ishida wrote:

> Comments are being sought on this article

| UTF-16 is often used for the system back-end.

You have "no byte order problem" for UTF-8, so you might add a note about
UTF-16LE vs. UTF-16BE below UTF-16.

And another note that u+10000 etc. needs two UTF-16 "half words" (please
replace correct term).

| Font display problems:

|    Legacy code pages (eg ISO-8859-1/windows-1252)

That example isn't convincing, use something else, e.g. Latin-2 and

| Page weight / download cost is not really an issue
| the difference between legacy encoding and Unicode encoding is quite 
| negligible.

Maybe s/Unicode/UTF-8/, you're talking about bytes later.

| HTML head, eg, <meta http-equiv="Content-Type" [...]

Maybe add a third example for XML:
 <?xml version="1.1" encoding="utf-8" ?>


Frank Yung-Fong Tang
Thu 8/11/2005 21:23

This is comment for related document, but not exactly the one you point out.

1. Can you change the example in


The line in the HTTP header typically looks like this:

     Content-Type: text/html; charset=iso-8859-1


The line in the HTTP header typically looks like this:

     Content-Type: text/html; charset=UTF-8

I know it is just an example in a different page, but some dump person
sometime just like to copy code from example. And I think it is nice to let
those dummer to copy UTF-8 instead of ISO-8859-1 even either of them are bad
choice to hard code.

2. Also, in http://www.w3.org/International/O-HTTP-charset
"For Java Servlets, use the setContentType  method on the ServletResponse
before obtaining any object (Stream or Writer) used for output, e.g.:
resource.setContentType ("text/html;charset=utf-8"); If you use a Writer,
the Servlet automatically takes care of the conversion from Java Strings to
the encoding selected."

I think this infor is only recommend for the use of J2EE 1.3. The J2EE
1.4 change it by adding the setCharacterEncoding(java.lang.String)  method.

in 1.4 version of J2EE ServletResponse document
"The charset for the MIME body response can be specified explicitly using
the setCharacterEncoding(java.lang.String) and
setContentType(java.lang.String) methods, or implicitly using the
setLocale(java.util.Locale) method. Explicit specifications take precedence
over implicit specifications. If no charset is specified,
ISO-8859-1 will be used. The setCharacterEncoding, setContentType, or
setLocale method must be called before getWriter and before committing the
response for the character encoding to be used."

You should mention the  setCharacterEncoding(java.lang.String) there for
J2EE 1.4.

Richard Ishida wrote on 8/11/2005, 1:09 PM:

 > Title: Changing page encoding to UTF-8  >
 > Comments are being sought on this article prior to final release.
 > Please send any comments to www-international@w3.org. We expect to  >
publish a final version in one to two weeks.
 > The article aims to answer the question: "How do I change the encoding  >
of my (X)HTML pages to UTF-8?"


Frank Yung-Fong Tang
Thu 8/11/2005 21:31

since your document title is "FAQ: Changing page encoding to UTF-8 (Draft
for review)" instead of "FAQ: Changing html page encoding to
UTF-8 (Draft for review)", I recommend you also consider the slightly
different case in the WS environement, e.g. the case for WSDL, XML Schema,
UDDI and SOAP in WS-I Basic Profile 1. Please take a look at my study note
in http://people.netscape.com/ftang/paper/WS-I-i18n.htm
for details. If you think the issue with SOAP/XML Schema/UDDI and WSDL may
be too complicate to be mention in your document, then I suggest you change
your document title to "FAQ: Changing (x)html page encoding to UTF-8" by
adding "(x)html" to it.


This e-mail (and any attachments) is confidential and may contain personal
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system. 
Do not use, copy or disclose the information in any way nor act in reliance
on it and notify the sender immediately. Please note that the BBC monitors
e-mails sent or received. 
Further communication will signify your consent to this.
Received on Wednesday, 5 October 2005 16:42:22 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:28:03 UTC