- From: Rick Jelliffe <ricko@allette.com.au>
- Date: Wed, 11 Dec 1996 21:15:55 +1100
- To: "David G. Durand" <dgd@cs.bu.edu>
- CC: w3c-sgml-wg@w3.org
David G. Durand wrote: > In concrete, I propose that we allow SDATA entities, and define that > they are _only_ to be used to represent undefined characters by a > descriptive string. > I would also invite some of the new list members that I know have character > set expertise to comment on the alternative of informally assigned > numerical codes versus informally assigned character strings. Some members of the list probably are not aware of the SPREAD public entity sets, which provide a standard mecanism for encoding ISO 10646 characters. I am just (now) finishing the Unicode 2.0 version. The Unicode 1.1 version has been available for ten months. This entity set is unusual in that it has been examined by East Asian academics, standards and industry people, and endorsed by the CJK DOCP, which has a liason to ISO WG8. I hope that SPREAD, or a version of it, can be included in XML. In order to make sure there is no confusion about what SPREAD actually proposes as far as names and numbers, here are some excepts from the entity set. The first is the hub entity: the SPREAD entities are split into blocks, so that only those needed can be loaded. The identifiers of the entities use the (hex) code number of the Unicode 2.0 set: anyone who has a Unicode 1.1, Unicode 2.0, ISO 10646 book, or the Unicode 2.0 CDROM (I think), can look up the characters. The SDATA value of the entities usually is the ISO 10646 name, except for Han ideograms and Hangul syllables, which just use the recommended Unicode short form. === <!-- SPREAD - Standardisation Project Regarding East Asian Documents Universal Public Entity Set, 1 November, 1996 --> <!-- This public entity set has the ISO 9070 formal public identifier: -//SPREAD//ENTITIES ISO/IEC 10646-1:1993//EN This set brings in all characters defined in ISO/IEC 10646-1 BMP, and the Unicode 2.0 character set of the Unicode Consortium. More information can be found at: http://www.allette.com.au/ercs/allent.html It is strongly recommended that these entity reference to these entities should always have an explicit Entity Reference Close delimiter (ERC). --> <!ENTITY % arabic PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (ARABIC)//EN" "arabic.txt"> %arabic; <!ENTITY % armenian PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (ARMENIAN)//EN" "armenian.txt"> %armenian; <!ENTITY % bengali PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (BENGALI)//EN" "bengali.txt"> %bengali; <!ENTITY % bopomofo PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (BOPOMOFO)//EN" "bopomofo.txt"> %bopomofo; <!ENTITY % cjk-misc PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (CJK-MISC)//EN" "cjk-misc.txt"> %cjk-misc; <!ENTITY % cyrillic PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (CYRILLIC)//EN" "cyrillic.txt"> %cyrillic; <!ENTITY % devanagari PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (DEVANAGARI)//EN" "devanagari.txt"> %devanagari; <!ENTITY % diacrits PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (DIACRITICALS)//EN" "diacrits.txt"> %diacrits; <!ENTITY % dingbat PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (DINGBATS)//EN" "dingbat.txt"> %dingbat; <!ENTITY % georgian PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (GEORGIAN)//EN" "georgian.txt"> %georgian; <!ENTITY % greek PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (GREEK)//EN" "greek.txt"> %greek; <!ENTITY % gujarati PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (GUJARATI)//EN" "gujarati.txt"> %gujarati; <!ENTITY % gurmukhi PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (GURMUKHI)//EN" "gurmukhi.txt"> %gurmukhi; <!ENTITY % han PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (HAN)//EN" "han.txt"> %han; <!ENTITY % hangul PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (HANGUL)//EN" "hangul.txt"> %hangul; <!ENTITY % hebrew PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (HEBREW)//EN" "hebrew.txt"> %hebrew; <!ENTITY % hiragana PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (HIRAGANA)//EN" "hiragana.txt"> %hiragana; <!ENTITY % ipa PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (IPA)//EN" "ipa.txt"> %ipa; <!ENTITY % jamo PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (JAMO)//EN" "jamo.txt"> %jamo; <!ENTITY % kannada PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (KANNADA)//EN" "kannada.txt"> %kannada; <!ENTITY % katakana PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (KATAKANA)//EN" "katakana.txt"> %katakana; <!ENTITY % lao PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (LAO)//EN" "lao.txt"> %lao; <!ENTITY % latin PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (LATIN)//EN" "latin.txt"> %latin; <!ENTITY % malayam PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (MALAYAM)//EN" "malayam.txt"> %malayam; <!ENTITY % oriya PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (ORIYA)//EN" "oriya.txt"> %oriya; <!ENTITY % punctuation PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (PUNCTUATION)//EN" "punctuation.txt"> %punctuation; <!ENTITY % symbol PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (SYMBOL)//EN" "symbol.txt"> %symbol; <!ENTITY % tamil PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (TAMIL)//EN" "tamil.txt"> %tamil; <!ENTITY % telugu PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (TELUGU)//EN" "telugu.txt"> %telugu; <!ENTITY % thai PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (THAI)//EN" "thai.txt"> %thai; <!ENTITY % tibetan PUBLIC "-//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (TIBETAN)//EN" "tibet.txt"> %tibetan; === Next, here is the SPREAD public entity set for Han ideograms. These have no useful romanised names, so the SDATA value just copies their number. === !-- SPREAD - Standardisation Project Regarding East Asian Documents Public Entity Set for Han, 1 November, 1996 --> <!-- This public entity set has the ISO 9070 formal public identifier: -//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (HAN)//EN --> <!-- There are of course a large number of Han ideographs. So they are not all declared as entities here. Instead you can define what you need: use the following template and replace "xxxx" with the HEX number you need for the character: <!ENTITY Uxxxx SDATA "[U+xxxx]"> --> <!-- POSSIBLE FORMAT FOR ENTITIES FOR VARIANT HAN IDEOGRAPHS The Han ideographs in ISO/IEC 10646 have been unified. If you require particular variants, you may use the following convention: append the standard two or three digit country or language code to the entity name. <!ENTITY Uxxxxcc SDATA "[U+xxxx, cc]"> For example: <!ENTITY U5B81JP SDATA "[U+5B81, JP]" -- Japanese variant --> Standard affixes include: JP Japan KR Korea CN Simplified Chinese (PRC) TW Traditional Chinese (Taiwan) HK Hong Kong SG Singaporean VT Vietnamese If you define entities using this convention, please be careful to put them in their own entity set, so that they are not confused with the SPREAD entities proper. --> === Finally, here is are the SDATA entities for a block for which there are sensible names. === <!-- SPREAD - Standardisation Project Regarding East Asian Documents Public Entity Set for Thai, 2 January, 1996 --> <!-- This public entity set has the ISO 9070 formal public identifier: -//SPREAD//ENTITIES ISO/IEC 10646-1:1993 (THAI)//EN --> <!-- THAI --> <!ENTITY U0E01 SDATA "[THAI CHARACTER KO KAI]"> <!ENTITY U0E02 SDATA "[THAI CHARACTER KHO KHAI]"> <!ENTITY U0E03 SDATA "[THAI CHARACTER KHO KHUAT]"> <!ENTITY U0E04 SDATA "[THAI CHARACTER KHO KHWAI]"> <!ENTITY U0E05 SDATA "[THAI CHARACTER KHO KHON]"> <!ENTITY U0E06 SDATA "[THAI CHARACTER KHO RAKHANG]"> <!ENTITY U0E07 SDATA "[THAI CHARACTER NGO NGU]"> <!ENTITY U0E08 SDATA "[THAI CHARACTER CHO CHAN]"> <!ENTITY U0E09 SDATA "[THAI CHARACTER CHO CHING]"> <!ENTITY U0E0A SDATA "[THAI CHARACTER CHO CHANG]"> <!ENTITY U0E0B SDATA "[THAI CHARACTER SO SO]"> <!ENTITY U0E0C SDATA "[THAI CHARACTER CHO CHOE]"> <!ENTITY U0E0D SDATA "[THAI CHARACTER YO YING]"> <!ENTITY U0E0E SDATA "[THAI CHARACTER DO CHADA]"> <!ENTITY U0E0F SDATA "[THAI CHARACTER TO PATAK]"> <!ENTITY U0E10 SDATA "[THAI CHARACTER THO THAN]"> <!ENTITY U0E11 SDATA "[THAI CHARACTER THO NANGMONTHO]"> <!ENTITY U0E12 SDATA "[THAI CHARACTER THO PHUTHAO]"> <!ENTITY U0E13 SDATA "[THAI CHARACTER NO NEN]"> <!ENTITY U0E14 SDATA "[THAI CHARACTER DO DEK]"> <!ENTITY U0E15 SDATA "[THAI CHARACTER TO TAO]"> <!ENTITY U0E16 SDATA "[THAI CHARACTER THO THUNG]"> <!ENTITY U0E17 SDATA "[THAI CHARACTER THO THAHAN]"> <!ENTITY U0E18 SDATA "[THAI CHARACTER THO THONG]"> <!ENTITY U0E19 SDATA "[THAI CHARACTER NO NU]"> <!ENTITY U0E1A SDATA "[THAI CHARACTER BO BAIMAI]"> <!ENTITY U0E1B SDATA "[THAI CHARACTER PO PLA]"> <!ENTITY U0E1C SDATA "[THAI CHARACTER PHO PHUNG]"> <!ENTITY U0E1D SDATA "[THAI CHARACTER FO FA]"> <!ENTITY U0E1E SDATA "[THAI CHARACTER PHO PHAN]"> <!ENTITY U0E1F SDATA "[THAI CHARACTER FO FAN]"> <!ENTITY U0E20 SDATA "[THAI CHARACTER PHO SAMPHAO]"> <!ENTITY U0E21 SDATA "[THAI CHARACTER MO MA]"> <!ENTITY U0E22 SDATA "[THAI CHARACTER YO YAK]"> <!ENTITY U0E23 SDATA "[THAI CHARACTER RO RUA]"> <!ENTITY U0E24 SDATA "[THAI CHARACTER RU]"> <!ENTITY U0E25 SDATA "[THAI CHARACTER LO LING]"> <!ENTITY U0E26 SDATA "[THAI CHARACTER LU]"> <!ENTITY U0E27 SDATA "[THAI CHARACTER WO WAEN]"> <!ENTITY U0E28 SDATA "[THAI CHARACTER SO SALA]"> <!ENTITY U0E29 SDATA "[THAI CHARACTER SO RUSI]"> <!ENTITY U0E2A SDATA "[THAI CHARACTER SO SUA]"> <!ENTITY U0E2B SDATA "[THAI CHARACTER HO HIP]"> <!ENTITY U0E2C SDATA "[THAI CHARACTER LO CHULA]"> <!ENTITY U0E2D SDATA "[THAI CHARACTER O ANG]"> <!ENTITY U0E2E SDATA "[THAI CHARACTER HO NOKHUK]"> <!ENTITY U0E2F SDATA "[THAI CHARACTER PAIYANNOI]"> <!ENTITY U0E30 SDATA "[THAI CHARACTER SARA A]"> <!ENTITY U0E31 SDATA "[THAI CHARACTER MAI HAN-AKAT]"> <!ENTITY U0E32 SDATA "[THAI CHARACTER SARA AA]"> <!ENTITY U0E33 SDATA "[THAI CHARACTER SARA AM]"> <!ENTITY U0E34 SDATA "[THAI CHARACTER SARA I]"> <!ENTITY U0E35 SDATA "[THAI CHARACTER SARA II]"> <!ENTITY U0E36 SDATA "[THAI CHARACTER SARA UE]"> <!ENTITY U0E37 SDATA "[THAI CHARACTER SARA UEE]"> <!ENTITY U0E38 SDATA "[THAI CHARACTER SARA U]"> <!ENTITY U0E39 SDATA "[THAI CHARACTER SARA UU]"> <!ENTITY U0E3A SDATA "[THAI CHARACTER PHINTHU]"> <!ENTITY U0E3F SDATA "[THAI CURRENCY SYMBOL BAHT]"> <!ENTITY U0E40 SDATA "[THAI CHARACTER SARA E]"> <!ENTITY U0E41 SDATA "[THAI CHARACTER SARA AE]"> <!ENTITY U0E42 SDATA "[THAI CHARACTER SARA O]"> <!ENTITY U0E43 SDATA "[THAI CHARACTER SARA AI MAIMUAN]"> <!ENTITY U0E44 SDATA "[THAI CHARACTER SARA AI MAIMALAI]"> <!ENTITY U0E45 SDATA "[THAI CHARACTER LAKKHANGYAO]"> <!ENTITY U0E46 SDATA "[THAI CHARACTER MAIYAMOK]"> <!ENTITY U0E47 SDATA "[THAI CHARACTER MAITAIKHU]"> <!ENTITY U0E48 SDATA "[THAI CHARACTER MAI EK]"> <!ENTITY U0E49 SDATA "[THAI CHARACTER MAI THO]"> <!ENTITY U0E4A SDATA "[THAI CHARACTER MAI TRI]"> <!ENTITY U0E4B SDATA "[THAI CHARACTER MAI CHATTAWA]"> <!ENTITY U0E4C SDATA "[THAI CHARACTER THANTHAKHAT]"> <!ENTITY U0E4D SDATA "[THAI CHARACTER NIKHAHIT]"> <!ENTITY U0E4E SDATA "[THAI CHARACTER YAMAKKAN]"> <!ENTITY U0E4F SDATA "[THAI CHARACTER FONGMAN]"> <!ENTITY U0E50 SDATA "[THAI DIGIT ZERO]"> <!ENTITY U0E51 SDATA "[THAI DIGIT ONE]"> <!ENTITY U0E52 SDATA "[THAI DIGIT TWO]"> <!ENTITY U0E53 SDATA "[THAI DIGIT THREE]"> <!ENTITY U0E54 SDATA "[THAI DIGIT FOUR]"> <!ENTITY U0E55 SDATA "[THAI DIGIT FIVE]"> <!ENTITY U0E56 SDATA "[THAI DIGIT SIX]"> <!ENTITY U0E57 SDATA "[THAI DIGIT SEVEN]"> <!ENTITY U0E58 SDATA "[THAI DIGIT EIGHT]"> <!ENTITY U0E59 SDATA "[THAI DIGIT NINE]"> <!ENTITY U0E5A SDATA "[THAI CHARACTER ANGKHANKHU]"> <!ENTITY U0E5B SDATA "[THAI CHARACTER KHOMUT]"> === I hope this clarifies exactly what the SPREAD entities provide, and what XML should provide to support them. They should be used whenever using numeric character references (to the presumed ISO 10646 character set) are not appropriate, I guess. I hope SGML Open vendors will think about providing the SPREAD entities as part of their product distribution support material. In particular, they may provide a convenient way to make 8-bit products useful to Asian and Middle Eastern markets. Personally, I think there are great advantages of having the SPREAD entities as a fallback mechanism. Not the least is that the identifier only is five characters. And you don't need to resort to hashing or multistage lookups to go from identifier to Unicode value. -- Regards Rick Jelliffe email: ricko@allette.com.au _______________________________________________________________ Allette Systems (Australia) email: info@allette.com.au Level 10, 91 York Street www: http://www.allette.com.au Sydney 2000 NSW Australia phone: +61 2 262 4777 fax: +61 2 262 4774 _______________________________________________________________
Received on Wednesday, 11 December 1996 05:12:18 UTC