- From: Albretch Mueller <lbrtchx@gmail.com>
- Date: Fri, 27 Sep 2019 12:30:56 +0200
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Cc: "www-international@w3.org" <www-international@w3.org>
On 9/27/19, Martin J. Dürst <duerst@it.aoyama.ac.jp> wrote: > I have seen attempts at creating such lists in the 1990ies. Could you please point me in the direction to that prior art? > Maybe you can clarify. When you do corpora research, you can't impose your will or understanding on the people that authored certain data in whichever way they chose to, but you can automatically decode and reencode (to UTF-8 if you choose to) text. > But these > days, the best advice, as you mention, is "just use UTF-8". UTF-8 is a reasonably well thought out way of kind of american-standard-code-information-interchanging textual data, but again the functional assumption of such an encoding is that you will sequentially read the textual data from start to end if you need to access it: "the tyranny of sequencing". Again when you are dealing with massive amounts of data at once those kinds of assumptions are not really helpful. There is a reason why people invented alphabets and code pages are fine as long as their mapping to characters is well specified. lbrtchx
Received on Friday, 27 September 2019 10:31:19 UTC