W3C home > Mailing lists > Public > www-international@w3.org > April to June 2000

Re: How to distinguish UTF-8 from Latin-* ?

From: Vinod Balakrishnan <vinod@filemaker.com>
Date: Tue, 20 Jun 2000 09:55:19 -0700
Message-Id: <200006201651.JAA14524@imap.filemaker.com>
To: "Martin J. Duerst" <duerst@w3.org>, "www-international@w3.org" <www-international@w3.org>


Hi Martin J. Duerst

>Hello Vinod,
>
>At 00/06/16 14:21 -0700, Vinod Balakrishnan wrote:
>>Hi,
>>
>>How can we distinguish the UTF-8 characters sequence from a
>>Latin-1/Latin-? characters.
>
>First, I think you are speaking about a byte sequence, not a
>character sequence. It is quite easy to have a look at a byte
>sequence and heuristically decide whether it is UTF-8 or not.
>Please for example have a look at
>http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf
>
We can parse a byte sequence and check whether it is a UTF-8 or not. But 
these valid UTF-8 sequence can also be valid Latin-*/Shift-JIS sequence.
 
>
>
>>In case of most of the internet application
>>UTF16 characters are prefixed by "0xu" and for the UTF8 characters there
>>is no prefix to identify those. Do we HAVE/NEED a standard to represent
>>UTF8 ?
>>
>>For example, if the browser send out a http GET request for a non-Roman
>>characters with out the header information, the server application will
>>not be able to identify the characters whether they are UTF8 or Latin-1.
>
>Do you mean non-ASCII characters in the URIs (or parts of URIs) in
>the GET line itself?
 
Yes..

>This is indeed a gray area, but the general
>tendency is to move towards UTF-8 only. In cases where both
>UTF-8 and a single 'legacy encoding' are used, the above heuristics
>may help.

This is going to be problem in case of European (high ASCII)/CJK cases 
once the browsers start sending the URLs in both UTF8 and other 
traditional Latin/CJK encodings. Again this problem will affect only the 
script systems which uses high ascii values in their non-unicode encodings

This can happen only in case of HTML(not in XML, which recommends unicode 
encodings), because it supports both the  Latin and Unicode encoding.
>
>Regards,   Martin.
>
>
>
Received on Tuesday, 20 June 2000 12:52:34 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:55 GMT