Re: languages to encodings associations . . . from Martin J. Dürst on 2019-09-27 (www-international@w3.org from July to September 2019)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Fri, 27 Sep 2019 11:09:53 +0000
To: Albretch Mueller <lbrtchx@gmail.com>
CC: "www-international@w3.org" <www-international@w3.org>
Message-ID: <2ab24b54-0d40-1b8d-93a1-a13d87e2df63@it.aoyama.ac.jp>

On 2019/09/27 19:30, Albretch Mueller wrote:
> On 9/27/19, Martin J. Dürst <duerst@it.aoyama.ac.jp> wrote:
>> I have seen attempts at creating such lists in the 1990ies.
> 
>   Could you please point me in the direction to that prior art?

I said *attempts*.

>> Maybe you can clarify.
> 
>   When you do corpora research, you can't impose your will or
> understanding on the people that authored certain data in whichever
> way they chose to, but you can automatically decode and reencode (to
> UTF-8 if you choose to) text.

Yes. What's important in such a case is not which encoding may be suited 
for which language, but which encoding was actually used.

And the modern approach for such a problem would be to try encodings for 
reencoding, and then check whether the result fits a probabilistic model 
of the language in question. But in some cases, you may not even know 
the language.

>> But these
>> days, the best advice, as you mention, is "just use UTF-8".
> 
>   UTF-8 is a reasonably well thought out way of kind of
> american-standard-code-information-interchanging textual data, but
> again the functional assumption of such an encoding is that you will
> sequentially read the textual data from start to end if you need to
> access it: "the tyranny of sequencing".

There is no such assumption. UTF-8 was explicitly and carefully designed 
to make it possible to synchronize (i.e. find starting bytes for 
characters) in the middle of a data stream. That was after bad 
experiences with other encodings.

>   Again when you are dealing with massive amounts of data at once those
> kinds of assumptions are not really helpful. There is a reason why
> people invented alphabets and code pages are fine as long as their
> mapping to characters is well specified.

I don't know why UTF-8 shouldn't be suitable for massive amounts of 
data. It's used that way in a lot of places these days.

Regards,    Martin.

>   lbrtchx
> .
>

Received on Friday, 27 September 2019 11:10:19 UTC