W3C home > Mailing lists > Public > www-international@w3.org > July to September 2005

Re: New article published: Changing (X)HTML page encoding to UTF-8

From: Tex Texin <tex@xencraft.com>
Date: Sat, 27 Aug 2005 05:24:10 -0700
Message-ID: <43105B6A.54C35E19@xencraft.com>
To: Richard Ishida <ishida@w3.org>
CC: www-international@w3.org

Richard,

Hi. It's nice to see the steady stream of faqs and updates. Good going!

A couple comments:

1) This is rather subjective, but the link "it's useful" would be better
replaced by a crisp sentence or two on the benefit of moving to utf-8. I
also think linking into the middle of the tutorial is disorienting and
unexpected.
It's not the end of the world, but I think a good UI makes you trust links,
and comforts you by giving you what you expect.
Perhaps some type of indication of the nature of the link's target is called
for. (fax, tutorial, article...)

2) The faq should point out a few of the risks and either how to reduce the
risk or where to go to learn more about it.
In particular, a faq on changing encodings should say what to look out for
and how to check that it actually succeeded.
Most of the following are not high probability, but we should warn naive
users to consider the possibility.

Risk a:
When changing the encoding to utf-8, it is critical that the encoding of the
original data be known accurately and precisely. Much of the world's data is
mislabeled. Iso 8859-1 instead of windows1252, big-5 instead of big5-hkscs,
cp936 for gb2312, cp949 for ksc 5601, and so forth. (And not just microsoft
encodings) And many editors will merrily convert the data to utf-8 as if it
were iso 8859-1 and not the encoding it actually is.

Risk b: The conversion tables or programs should be up to date. Some
convertors are now seriously out of date.
Unicode has more choices for characters now...

Risk c: Some old software might use incorrect encodings for utf-8,
especially with respect to surrogates.

Risk d: For some legacy encodings, it might be worth pointing out that a
convertor should generate NFC.

3) A different kind of risk, is understanding the type of data being
represented, and whether changing the encoding changes the semantics.

Risk e: URLs
If the document changes the encoding, any URLs in the document that contain
a query portion, might now have a broken link, if the query isn't first put
into an ascii-compatible form.

Risk f: FORMS, Applications
If the document contains a form, by changing the encoding, the form will
send data to the server in utf-8 rather than the original encoding. The
server application may need to have a corresponding change to take this into
account.

Risk g: CSS
If the CSS document does not contain an encoding declaration, it can inherit
the encoding of the referring document. Changing the encoding of the (X)HTML
document may require CSS documents it references to also change encoding.
For CSS sheets shared by several documents, this can be a problem unless all
are changed at the same time.

Risk h: Embedded scripts
Any php, javascript, etc. within the document that now needs to have its
code altered?

Given time, I could probably come up with a few more. If others contribute
to the list, it might make a nice separate faq or document on unicode
conversion considerations.

4) QA
The other piece then, is how do you know that the conversion was successful,
other than the process seemed to complete.
Is the result valid utf-8? 
Did the characters convert appropriately? (e.g. Did yen sign convert to
backslash or currency sign based on context?) Does the document still have
the same meaning? (Do users think a character has changed?)
Does it still integrate with other applications (eg cgi, etc.)
appropriately?

Most readers on www-international, would intutively recognize situations
where any of the above risks might be probable and would either not do the
conversion or first address the risk, so the point might seem trite.
But many of the people searching out the faq might not anticipate that
problems can occur, so they should be made clear without scaring them off.

hth
tex


Richard Ishida wrote:
> 
> After incorporating comments from the review phase, the GEO Working Group has published the FAQ-based article:
> 
>         Changing (X)HTML page encoding to UTF-8
>         http://www.w3.org/International/questions/qa-changing-encoding
>         By Richard Ishida, W3C
> 
> Aimed at newcomers to internationalization who want to change the encoding of their (X)HTML pages, this article provides an answer to the question: How do I change the encoding of my (X)HTML pages to UTF-8?

> 

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com
                         
XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------
Received on Saturday, 27 August 2005 12:24:21 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:05 GMT