Re: How to distinguish UTF-8 from Latin-* ? from Martin J. Duerst on 2000-06-20 (www-international@w3.org from April to June 2000)

From: Martin J. Duerst <duerst@w3.org>
Date: Tue, 20 Jun 2000 14:25:55 +0900
To: Vinod Balakrishnan <vinod@filemaker.com>, <www-international@w3.org>
Message-Id: <4.2.0.58.J.20000620141636.036667b0@sh.w3.mag.keio.ac.jp>

Hello Vinod,

At 00/06/16 14:21 -0700, Vinod Balakrishnan wrote:
>Hi,
>
>How can we distinguish the UTF-8 characters sequence from a
>Latin-1/Latin-? characters.

First, I think you are speaking about a byte sequence, not a
character sequence. It is quite easy to have a look at a byte
sequence and heuristically decide whether it is UTF-8 or not.
Please for example have a look at
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf

>In case of most of the internet application
>UTF16 characters are prefixed by "0xu" and for the UTF8 characters there
>is no prefix to identify those. Do we HAVE/NEED a standard to represent
>UTF8 ?
>
>For example, if the browser send out a http GET request for a non-Roman
>characters with out the header information, the server application will
>not be able to identify the characters whether they are UTF8 or Latin-1.

Do you mean non-ASCII characters in the URIs (or parts of URIs) in
the GET line itself? This is indeed a gray area, but the general
tendency is to move towards UTF-8 only. In cases where both
UTF-8 and a single 'legacy encoding' are used, the above heuristics
may help.

Regards,   Martin.

Received on Tuesday, 20 June 2000 02:26:20 UTC