RE: japanese encoding nightmare

Hi Richard,
That page seems incomplete and potentially dangerous.

1) Simply saying to save as utf-8 ignores the problem of knowing which
encoding you are starting from.
Often text is thought to be iso-8859-1, big-5 or some other encoding and it
is actually 1252, big5-hkscs or a variant or different encoding.
If the source encoding is incorrect, then the conversion to utf-8 may result
in the wrong characters and data loss.

The document should make sure users proactively identify the correct
encoding of the page before transcoding.


2) When converting text or html to utf-8 special consideration needs to be
given to URLs. A URL has 4 parts: scheme, domain, path and query.
Schemes are ASCII and not a problem to convert to utf-8 as they remain
ASCII. Domains and Paths should be convertible to UTF-8.
(They will go thru additional conversions to an ASCII form before going over
the wire.)

However the query portion of a URL is not necessarily convertible to
Unicode. The query portion represents data that is used as a reference
within some other application pointed to by the remainder of the URL. That
application may require an encoding other than UTF-8 or it may not be
textual.
Conversion to utf-8 may therefore damage the URL.

For example, I might have a cgi and database application based on
iso-8859-1.
The original URL might be the following contrived example (I left off the
scheme http: since it isn't a working url) www.i18nguy.com/?find=cafe

In a page encoded as iso-8859-1 the e-acute will be represented by a single
byte as 0xE9.
The i18nguy.com cgi and database application will expect to match the byte
0xE9.

If the URL is transcoded to UTF-8, the character e-acute will become two
bytes and represented in the URL by hex encoding as %C3%A9.
The URL will no longer work unless the application is also modified to
expect UTF-8 values.

However, when the x(h)tml page is transcoded to utf-8, the embedded URLs may
be links to applications that we have no control over and they may be
affected.

Therefore a more appropriate recommendation might be to first represent the
query portions of a URL by a hex-encoded form in the original encoding, and
then the page can be converted to utf-8.

E.g. convert www.i18nguy.com/?find=cafe to www.i18nguy.com/?find=caf%E9
Subsequent transcoding to utf-8 won't change the value %E9.

On the other hand, simply transcoding to utf-8 will give
www.i18nguy.com/?find=caf%C3%A9 which will break the link or reference the
incorrect value in the target application.

====
Haven't we been over this ground before? Perhaps in one of the other
documents. The page should be updated.

tex

-----Original Message-----
From: public-evangelist-request@w3.org
[mailto:public-evangelist-request@w3.org] On Behalf Of Richard Ishida
Sent: Thursday, November 23, 2006 2:16 AM
To: 'Paul Arenson'; public-evangelist@w3.org
Subject: RE: japanese encoding nightmare


Paul, read this and let me know if you still have questions:
 

Changing (X)HTML page encoding to UTF-8
http://www.w3.org/International/questions/qa-changing-encoding
 
RI



============
Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)

http://www.w3.org/People/Ishida/
http://www.w3.org/International/
http://people.w3.org/rishida/blog/
http://www.flickr.com/photos/ishida/


 


________________________________

	From: public-evangelist-request@w3.org
[mailto:public-evangelist-request@w3.org] On Behalf Of Paul Arenson
	Sent: 13 November 2006 01:51
	To: public-evangelist@w3.org
	Cc: Paul Arenson
	Subject: japanese encoding nightmare
	
	
	Hello 

	I came here via
	http://www.webstandards.org/learn/articles/askw3c/dec2002/

	For a long time I have used Mozilla to create (or adapt other) web
pages.


	It has worked. I went back and was surprised that it worked DESPITE
different encodings I inadvertantly used.

	But recently tried to make pages that did NOT work!!!! Am not sure
why. And so I am wriiting.


	UNSUCCESSFUL EXAMPLE (Looks ok on desktop but not on server)
	http://tokyoprogressive.org/why.html

	CODE
	<meta content="text/html; charset=UTF-8" http-equiv="content-type">


	here are successful example from the past:
	- - - - - - - - - - - - - 

	SUCCESSFUL EXAMPLE ONE (JAPANESE COMES OUT RIGHT)
	http://www.tokyoprogressive.org/index/weblog/print/april-entries/

	This was made via EXPRESSION ENGINE

	I note I have both xml: lang and uft-8.
	I also note I am confused about differences between character
encoding and language, but anyway, it works.

	CODE
	<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ja" lang="ja">
	<head>
	<title>April entries</title> 

	<meta http-equiv="Content-Type" content="text/html; charset=utf-8"
/>

	- - - - - - - - - - - - - 




	SUCCESSFUL EXAMPLE TWO 
	http://tokyoprogressive.org/indexoct2006.html

	THIS WAS MADE BY HAND USING a CSS TEMPLATE.

	I THOUGHT I did this in UFT-8, but no.
	Mozilla even says it is UFT-8, but as you can see the code is
western.
	In other words, why does it work?


	CODE
	<meta http-equiv="Content-Type"
	content="text/html; charset=iso-8859-1">

	- - - - - - - - - - - - - 



	SUCCESSFUL EXAMPLE THREE
	http://tokyoprogressive.org/indexnov2006.html
	Now here is one where I specified uft-8 and it too is ok!

	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


	SUCCESSUL EXAMPLE FOUR (most bizarre?)
	I even forgot to add the meta tag!!!

	http://tokyoprogressive.org/
	- - - - - - - - - - - - - 



	PROBLEMS STARTED APPEARING WITH NEW PAGES

	EXPERIMENT:

	Method

	Make a page in several encodings
	http://tokyoprogressive.org/a.html
	<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
	<html>
	<head>
	<meta content="text/html; charset=ISO-2022-JP"

	LOOKS OK ONLINE 
	- - - - - - - - - - - - - 

	http://tokyoprogressive.org/b.html
	<meta content="text/html; charset=UTF-8" http-equiv="content-type">
	DOES NOT LOOK OK ONLINE
	- - - - - - - - - - - - - 
	http://tokyoprogressive.org/c.html
	<meta content="text/html; charset=Shift_JIS"
http-equiv="content-type">
	DOES NOT LOOK OK ONLINE
	- - - - - - - - - - - - - 
	http://tokyoprogressive.org/d.html
	<meta content="text/html; charset=EUC-JP" http-equiv="content-type">
	DOES NOT LOOK OK ONLINE
	- - - - - - - - - - - - - 



	CONCLUSION:

	Can anyone tell me what is going on?


	Thanks!


		__/__/__/__/__/__/__/__/__/__/
	Paul Arenson

	EMAIL
	paul@tokyoprogressive.org


	__/__/__/__/__/__/__/__/__/__/

	

Received on Thursday, 23 November 2006 17:04:25 UTC