Re: How to distinguish UTF-8 from Latin-* ?

Hello Vinod,

At 00/06/16 14:21 -0700, Vinod Balakrishnan wrote:
>Hi,
>
>How can we distinguish the UTF-8 characters sequence from a
>Latin-1/Latin-? characters.

First, I think you are speaking about a byte sequence, not a
character sequence. It is quite easy to have a look at a byte
sequence and heuristically decide whether it is UTF-8 or not.
Please for example have a look at
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf



>In case of most of the internet application
>UTF16 characters are prefixed by "0xu" and for the UTF8 characters there
>is no prefix to identify those. Do we HAVE/NEED a standard to represent
>UTF8 ?
>
>For example, if the browser send out a http GET request for a non-Roman
>characters with out the header information, the server application will
>not be able to identify the characters whether they are UTF8 or Latin-1.

Do you mean non-ASCII characters in the URIs (or parts of URIs) in
the GET line itself? This is indeed a gray area, but the general
tendency is to move towards UTF-8 only. In cases where both
UTF-8 and a single 'legacy encoding' are used, the above heuristics
may help.

Regards,   Martin.

Received on Tuesday, 20 June 2000 02:26:20 UTC