- From: Vinod Balakrishnan <vinod@filemaker.com>
- Date: Tue, 20 Jun 2000 09:55:19 -0700
- To: "Martin J. Duerst" <duerst@w3.org>, "www-international@w3.org" <www-international@w3.org>
Hi Martin J. Duerst >Hello Vinod, > >At 00/06/16 14:21 -0700, Vinod Balakrishnan wrote: >>Hi, >> >>How can we distinguish the UTF-8 characters sequence from a >>Latin-1/Latin-? characters. > >First, I think you are speaking about a byte sequence, not a >character sequence. It is quite easy to have a look at a byte >sequence and heuristically decide whether it is UTF-8 or not. >Please for example have a look at >http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf > We can parse a byte sequence and check whether it is a UTF-8 or not. But these valid UTF-8 sequence can also be valid Latin-*/Shift-JIS sequence. > > >>In case of most of the internet application >>UTF16 characters are prefixed by "0xu" and for the UTF8 characters there >>is no prefix to identify those. Do we HAVE/NEED a standard to represent >>UTF8 ? >> >>For example, if the browser send out a http GET request for a non-Roman >>characters with out the header information, the server application will >>not be able to identify the characters whether they are UTF8 or Latin-1. > >Do you mean non-ASCII characters in the URIs (or parts of URIs) in >the GET line itself? Yes.. >This is indeed a gray area, but the general >tendency is to move towards UTF-8 only. In cases where both >UTF-8 and a single 'legacy encoding' are used, the above heuristics >may help. This is going to be problem in case of European (high ASCII)/CJK cases once the browsers start sending the URLs in both UTF8 and other traditional Latin/CJK encodings. Again this problem will affect only the script systems which uses high ascii values in their non-unicode encodings This can happen only in case of HTML(not in XML, which recommends unicode encodings), because it supports both the Latin and Unicode encoding. > >Regards, Martin. > > >
Received on Tuesday, 20 June 2000 12:52:34 UTC