- From: Tex Texin <tex@yahoo-inc.com>
- Date: Thu, 23 Nov 2006 09:03:55 -0800
- To: "'Richard Ishida'" <ishida@w3.org>, "'Paul Arenson'" <paul@tokyoprogressive.org>, <public-evangelist@w3.org>
Hi Richard, That page seems incomplete and potentially dangerous. 1) Simply saying to save as utf-8 ignores the problem of knowing which encoding you are starting from. Often text is thought to be iso-8859-1, big-5 or some other encoding and it is actually 1252, big5-hkscs or a variant or different encoding. If the source encoding is incorrect, then the conversion to utf-8 may result in the wrong characters and data loss. The document should make sure users proactively identify the correct encoding of the page before transcoding. 2) When converting text or html to utf-8 special consideration needs to be given to URLs. A URL has 4 parts: scheme, domain, path and query. Schemes are ASCII and not a problem to convert to utf-8 as they remain ASCII. Domains and Paths should be convertible to UTF-8. (They will go thru additional conversions to an ASCII form before going over the wire.) However the query portion of a URL is not necessarily convertible to Unicode. The query portion represents data that is used as a reference within some other application pointed to by the remainder of the URL. That application may require an encoding other than UTF-8 or it may not be textual. Conversion to utf-8 may therefore damage the URL. For example, I might have a cgi and database application based on iso-8859-1. The original URL might be the following contrived example (I left off the scheme http: since it isn't a working url) www.i18nguy.com/?find=cafe In a page encoded as iso-8859-1 the e-acute will be represented by a single byte as 0xE9. The i18nguy.com cgi and database application will expect to match the byte 0xE9. If the URL is transcoded to UTF-8, the character e-acute will become two bytes and represented in the URL by hex encoding as %C3%A9. The URL will no longer work unless the application is also modified to expect UTF-8 values. However, when the x(h)tml page is transcoded to utf-8, the embedded URLs may be links to applications that we have no control over and they may be affected. Therefore a more appropriate recommendation might be to first represent the query portions of a URL by a hex-encoded form in the original encoding, and then the page can be converted to utf-8. E.g. convert www.i18nguy.com/?find=cafe to www.i18nguy.com/?find=caf%E9 Subsequent transcoding to utf-8 won't change the value %E9. On the other hand, simply transcoding to utf-8 will give www.i18nguy.com/?find=caf%C3%A9 which will break the link or reference the incorrect value in the target application. ==== Haven't we been over this ground before? Perhaps in one of the other documents. The page should be updated. tex -----Original Message----- From: public-evangelist-request@w3.org [mailto:public-evangelist-request@w3.org] On Behalf Of Richard Ishida Sent: Thursday, November 23, 2006 2:16 AM To: 'Paul Arenson'; public-evangelist@w3.org Subject: RE: japanese encoding nightmare Paul, read this and let me know if you still have questions: Changing (X)HTML page encoding to UTF-8 http://www.w3.org/International/questions/qa-changing-encoding RI ============ Richard Ishida Internationalization Lead W3C (World Wide Web Consortium) http://www.w3.org/People/Ishida/ http://www.w3.org/International/ http://people.w3.org/rishida/blog/ http://www.flickr.com/photos/ishida/ ________________________________ From: public-evangelist-request@w3.org [mailto:public-evangelist-request@w3.org] On Behalf Of Paul Arenson Sent: 13 November 2006 01:51 To: public-evangelist@w3.org Cc: Paul Arenson Subject: japanese encoding nightmare Hello I came here via http://www.webstandards.org/learn/articles/askw3c/dec2002/ For a long time I have used Mozilla to create (or adapt other) web pages. It has worked. I went back and was surprised that it worked DESPITE different encodings I inadvertantly used. But recently tried to make pages that did NOT work!!!! Am not sure why. And so I am wriiting. UNSUCCESSFUL EXAMPLE (Looks ok on desktop but not on server) http://tokyoprogressive.org/why.html CODE <meta content="text/html; charset=UTF-8" http-equiv="content-type"> here are successful example from the past: - - - - - - - - - - - - - SUCCESSFUL EXAMPLE ONE (JAPANESE COMES OUT RIGHT) http://www.tokyoprogressive.org/index/weblog/print/april-entries/ This was made via EXPRESSION ENGINE I note I have both xml: lang and uft-8. I also note I am confused about differences between character encoding and language, but anyway, it works. CODE <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ja" lang="ja"> <head> <title>April entries</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> - - - - - - - - - - - - - SUCCESSFUL EXAMPLE TWO http://tokyoprogressive.org/indexoct2006.html THIS WAS MADE BY HAND USING a CSS TEMPLATE. I THOUGHT I did this in UFT-8, but no. Mozilla even says it is UFT-8, but as you can see the code is western. In other words, why does it work? CODE <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> - - - - - - - - - - - - - SUCCESSFUL EXAMPLE THREE http://tokyoprogressive.org/indexnov2006.html Now here is one where I specified uft-8 and it too is ok! <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> SUCCESSUL EXAMPLE FOUR (most bizarre?) I even forgot to add the meta tag!!! http://tokyoprogressive.org/ - - - - - - - - - - - - - PROBLEMS STARTED APPEARING WITH NEW PAGES EXPERIMENT: Method Make a page in several encodings http://tokyoprogressive.org/a.html <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html; charset=ISO-2022-JP" LOOKS OK ONLINE - - - - - - - - - - - - - http://tokyoprogressive.org/b.html <meta content="text/html; charset=UTF-8" http-equiv="content-type"> DOES NOT LOOK OK ONLINE - - - - - - - - - - - - - http://tokyoprogressive.org/c.html <meta content="text/html; charset=Shift_JIS" http-equiv="content-type"> DOES NOT LOOK OK ONLINE - - - - - - - - - - - - - http://tokyoprogressive.org/d.html <meta content="text/html; charset=EUC-JP" http-equiv="content-type"> DOES NOT LOOK OK ONLINE - - - - - - - - - - - - - CONCLUSION: Can anyone tell me what is going on? Thanks! __/__/__/__/__/__/__/__/__/__/ Paul Arenson EMAIL paul@tokyoprogressive.org __/__/__/__/__/__/__/__/__/__/
Received on Thursday, 23 November 2006 17:04:25 UTC