Re: languages to encodings associations . . . from Albretch Mueller on 2019-09-27 (www-international@w3.org from July to September 2019)

From: Albretch Mueller <lbrtchx@gmail.com>
Date: Fri, 27 Sep 2019 12:30:56 +0200
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: "www-international@w3.org" <www-international@w3.org>
Message-ID: <CAFakBwh0d6Qw2e=AvrpzUH1a=VnByrRuWoabhO6z=xbiMJsLwg@mail.gmail.com>

On 9/27/19, Martin J. Dürst <duerst@it.aoyama.ac.jp> wrote:
> I have seen attempts at creating such lists in the 1990ies.

 Could you please point me in the direction to that prior art?

> Maybe you can clarify.

 When you do corpora research, you can't impose your will or
understanding on the people that authored certain data in whichever
way they chose to, but you can automatically decode and reencode (to
UTF-8 if you choose to) text.

> But these
> days, the best advice, as you mention, is "just use UTF-8".

 UTF-8 is a reasonably well thought out way of kind of
american-standard-code-information-interchanging textual data, but
again the functional assumption of such an encoding is that you will
sequentially read the textual data from start to end if you need to
access it: "the tyranny of sequencing".

 Again when you are dealing with massive amounts of data at once those
kinds of assumptions are not really helpful. There is a reason why
people invented alphabets and code pages are fine as long as their
mapping to characters is well specified.

 lbrtchx

Received on Friday, 27 September 2019 10:31:19 UTC