Re: How to distinguish UTF-8 from Latin-* ?

At 00/06/20 09:55 -0700, Vinod Balakrishnan wrote:


>Hi Martin J. Duerst
>
> >Hello Vinod,
> >
> >At 00/06/16 14:21 -0700, Vinod Balakrishnan wrote:
> >>Hi,
> >>
> >>How can we distinguish the UTF-8 characters sequence from a
> >>Latin-1/Latin-? characters.
> >
> >First, I think you are speaking about a byte sequence, not a
> >character sequence. It is quite easy to have a look at a byte
> >sequence and heuristically decide whether it is UTF-8 or not.
> >Please for example have a look at
> >http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf
> >
>We can parse a byte sequence and check whether it is a UTF-8 or not. But
>these valid UTF-8 sequence can also be valid Latin-*/Shift-JIS sequence.

Yes, in theory many valid UTF-8 sequences can also be valid Latin-*
or iso-8859-* or whatever sequences. But what my paper shows is that
*in practice* such sequences are with extremely high probability
some arbitrary combinations of characters, and not something that
makes sense.


> >>For example, if the browser send out a http GET request for a non-Roman
> >>characters with out the header information, the server application will
> >>not be able to identify the characters whether they are UTF8 or Latin-1.
> >
> >Do you mean non-ASCII characters in the URIs (or parts of URIs) in
> >the GET line itself?
>
>Yes..
>
> >This is indeed a gray area, but the general
> >tendency is to move towards UTF-8 only. In cases where both
> >UTF-8 and a single 'legacy encoding' are used, the above heuristics
> >may help.
>
>This is going to be problem in case of European (high ASCII)/CJK cases
>once the browsers start sending the URLs in both UTF8 and other
>traditional Latin/CJK encodings. Again this problem will affect only the
>script systems which uses high ascii values in their non-unicode encodings
>
>This can happen only in case of HTML(not in XML, which recommends unicode
>encodings), because it supports both the  Latin and Unicode encoding.

Yes, and HTML has some help for it. First, it defines the
'accept-charset' attribute for forms:
http://www.w3.org/TR/REC-html40/interact/forms.html#adef-accept-charset
This is unfortunately not supported by that many browsers, but it
allows to indicate that a server understands UTF-8.

Second, HTML says that form responses should be sent back in the encoding
of the page itself. This is supported by most newer browsers.

This does not cover all edge cases, but seems to work in many cases.
W3C is working on a new generation of forms, where we will try to make
sure that this problem is properly addressed. See
http://www.w3.org/MarkUp/Forms/.


Regards,   Martin.

Received on Wednesday, 21 June 2000 01:01:29 UTC