Re: unicode from Jason Pouflis on 1998-06-02 (www-international@w3.org from April to June 1998)

From: Jason Pouflis <pouflis@eisa.net.au>
Date: Wed, 3 Jun 1998 09:39:41 +1000
To: "Erik van der Poel" <erik@netscape.com>, "Aman Choudhary" <aman@asu.edu>
Cc: <www-international@w3.org>
Message-ID: <000d01bd8e7f$b83984d0$0101010a@ehome>

In developing Multilingual DNS, I came across the same problem.
It is solvable, and I am available for hire.
Multilingual Domain Names are also for sale.


The techniques demonstrated here are proven with
MSIE4 international english + extra language support,
but have not yet been tested on other platforms.

>> I still havent found out a way to store information, which I retrieve
from
>> the internet in unicode and not in ascii, which means that I can get
>> information in practically any language from the internet.


>> What I really want to do-
>>
>> |  < meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=" *">
>> |   * - appropriate ISO code for that language
>> |        <input type = textbox>
>>                         | result (CGI/ASP)
>>                         V
>>             The text box value stored as UNICODE



=== BROWSERS
>>  < meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=" *">
As far as I can tell IE4 forces the page to be displayed in the specified
charset,
if it is available. Any character data submitted by form is in that
character set as
binaries or NCRs (numerical character reference data of the form &#nnnn;).
However, I have not got an answer from Microsoft nor Netscape on this.
(If anyone can give me the direct email for responsible people that would be
nice.)


=== HTML FORMS
Neither browser, as far as I can tell, sends the character set of encoded
data,
so you should include a hidden field in the form specifying the character
set. eg.
<input type="hidden" name="LC" value="EN.UTF-8"> or simply
<input type="hidden" name="C" value="Shift-JIS">.
[L = Language, C = Code]


=== CGIs - PERL
Use Unicode::Map or Unicode::Map8 to map form data from the native character
set
to unicode. eg.

#### code segment {
  $X = $cgi->param('X'); # string as characters encoded in charset
  $C = $cgi->param('C'); # which might be "Shift-JIS"
  use Unicode::Map();
  $Map = new Unicode::Map({ ID =>  $C });
  $_16bit = $Map -> to_unicode ($X);
#### } code segment


=== DB - MySQL
Then, escape any data before inserting into your database. eg.

#### code segment {
  use Mysql;
 $dbh = Mysql->Connect($host,$database,$password,$user)
    or perror('Cannot contact database server');

  my $query = 'insert into domain values (';
  foreach my $field (@columns) {
    $query = $query . $dbh->quote($Domain{$field}) . ', ';
  }
  $query = $query . ' )' ;
  my $cursor = $dbh->Query( $query )
    or perror("$Mysql::db_errstr Domain Creation Failed");
#### } code segment

>It would be more reliable if
>you indicated the charset in the HTTP Content-Type header. (I'm assuming
>you're using HTTP.)

>...
>echo 'Content-Type: text/html; charset=gb2312'
The charset tag is optional in HTTP 1.0, mandatory in HTTP 1.1.
Unfortunately, a lot of communication is stuck at HTTP 1.0,
meaning you will still need to put in the meta tag for content type.



Cheers,
Jason Pouflis
pouflis@eisa.net.au
e.internet pty ltd
  e.commerce
  e.business
  e.mail

Received on Tuesday, 2 June 1998 19:42:42 UTC