- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Fri, 27 Sep 2019 11:09:53 +0000
- To: Albretch Mueller <lbrtchx@gmail.com>
- CC: "www-international@w3.org" <www-international@w3.org>
On 2019/09/27 19:30, Albretch Mueller wrote: > On 9/27/19, Martin J. Dürst <duerst@it.aoyama.ac.jp> wrote: >> I have seen attempts at creating such lists in the 1990ies. > > Could you please point me in the direction to that prior art? I said *attempts*. >> Maybe you can clarify. > > When you do corpora research, you can't impose your will or > understanding on the people that authored certain data in whichever > way they chose to, but you can automatically decode and reencode (to > UTF-8 if you choose to) text. Yes. What's important in such a case is not which encoding may be suited for which language, but which encoding was actually used. And the modern approach for such a problem would be to try encodings for reencoding, and then check whether the result fits a probabilistic model of the language in question. But in some cases, you may not even know the language. >> But these >> days, the best advice, as you mention, is "just use UTF-8". > > UTF-8 is a reasonably well thought out way of kind of > american-standard-code-information-interchanging textual data, but > again the functional assumption of such an encoding is that you will > sequentially read the textual data from start to end if you need to > access it: "the tyranny of sequencing". There is no such assumption. UTF-8 was explicitly and carefully designed to make it possible to synchronize (i.e. find starting bytes for characters) in the middle of a data stream. That was after bad experiences with other encodings. > Again when you are dealing with massive amounts of data at once those > kinds of assumptions are not really helpful. There is a reason why > people invented alphabets and code pages are fine as long as their > mapping to characters is well specified. I don't know why UTF-8 shouldn't be suitable for massive amounts of data. It's used that way in a lot of places these days. Regards, Martin. > lbrtchx > . >
Received on Friday, 27 September 2019 11:10:19 UTC