- From: Martin J. Duerst <duerst@w3.org>
- Date: Tue, 20 Jun 2000 14:25:55 +0900
- To: Vinod Balakrishnan <vinod@filemaker.com>, <www-international@w3.org>
Hello Vinod, At 00/06/16 14:21 -0700, Vinod Balakrishnan wrote: >Hi, > >How can we distinguish the UTF-8 characters sequence from a >Latin-1/Latin-? characters. First, I think you are speaking about a byte sequence, not a character sequence. It is quite easy to have a look at a byte sequence and heuristically decide whether it is UTF-8 or not. Please for example have a look at http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf >In case of most of the internet application >UTF16 characters are prefixed by "0xu" and for the UTF8 characters there >is no prefix to identify those. Do we HAVE/NEED a standard to represent >UTF8 ? > >For example, if the browser send out a http GET request for a non-Roman >characters with out the header information, the server application will >not be able to identify the characters whether they are UTF8 or Latin-1. Do you mean non-ASCII characters in the URIs (or parts of URIs) in the GET line itself? This is indeed a gray area, but the general tendency is to move towards UTF-8 only. In cases where both UTF-8 and a single 'legacy encoding' are used, the above heuristics may help. Regards, Martin.
Received on Tuesday, 20 June 2000 02:26:20 UTC