Read it and weep from Chris Lilley on 2004-02-22 (www-international@w3.org from January to March 2004)

From: Chris Lilley <chris@w3.org>
Date: Mon, 23 Feb 2004 00:21:51 +0100
To: "www-international@w3.org"@homer.w3.org
Message-ID: <18310372336.20040223002151@w3.org>

Hello www-international,

>> Files here are either HTML or extended (8-bit) ASCII. Where
>> possible, text files are tab delimited. Some files have been
>> converted into standard HTML encoding (ISO-8859-1) from Unicode.

Gasp.

>> The closest equivalent character in ISO-8859-1 was selected, and
>> any diacritics simulated using <SUB> and <SUP> and the closest
>> equivalent punctuation mark. In the case of Cyrillic, Greek and
>> Hebrew, a consistent transliteration scheme was used. The source
>> for each file contains hidden tags which specify the Unicode value
>> for each character which has no ISO-8859-1 equivalent.

There is a standard way to do that

>> To obtain these values, you can download the file or view its
>> source in your browser. The tags have the form <!u
>> XXXX>character</!u>, where XXXX is the four digit hexadecimal value
>> of the Unicode character.

Shudder. Although a perl script could probably go and reverse this
damage.

http://www.wordgumbo.com/index.htm

Interesting site, but (shakes head) why oh why!!

Although
http://www.wordgumbo.com/ie/cmp/iedata.txt

     COMPARATIVE INDOEUROPEAN DATABASE COLLECTED BY ISIDORE DYEN

                            FILE IE-DATA1

  Copyright (C) 1997 by Isidore Dyen, Joseph Kruskal, and Paul Black
              This file was last modified on Feb 5, 1997

maybe that is why.


-- 
 Chris Lilley                    mailto:chris@w3.org
 Chair, W3C SVG Working Group
 Member, W3C Technical Architecture Group

Received on Sunday, 22 February 2004 18:21:51 UTC