- From: Olle Jarnefors <ojarnef@admin.kth.se>
- Date: Thu, 11 Nov 1993 22:34:43 +0100
- To: ietf-charsets@INNOSOFT.COM
- Cc: ietf-822@dimacs.rutgers.edu, David Herron <david@twg.com>, Olle Jarnefors <ojarnef@admin.kth.se>
David writes: > Languages do evolve. Mainly in idioms but also pronunciations &c. > Therefore the a system for tagging such things should most definitely have a > place for placing markings as to the "version" of the language. Calling it " > old" versus "middle" probably is not sufficient, depending on what level of > detail you want to support. ... The rejected ISO draft for three-letter codes included some historical languages codes: dum Dutch, Middle (ca. 1050-1350) egy Egyptian (Ancient) enm English, Middle (1100-1500) ang English, Old (ca 450-1100) frm Rench, Middle (ca. 1400-1600) fro French, Old (842-ca. 1400) gmh German, Middle High (ca. 1050-1500) goh German, Old High (ca. 750-1050) grc Greek, Ancient (to 1453) sga Irish, Old (to 900) mga Irish, Middle (900-1200) lat Latin non Norse, Old peo Persian, Old (ca. 600-400 B.C.) phn Phoenician pro Provenc,al, Old (to 1500) san Sanskrit ota Turkish, Ottoman (1500-1928) If somebody feels a strong need for making such distinctions, language variant codes or, in some cases new language codes, should be possible to register with IANA. > To my knowledge choosing the right glyphs is driven by the character set. Not always ... > So what we need is a sufficient quantity of character sets so we can discuss > old high germanic names in one paragraph, old english in the next, and > russian after that. Where does the need for marking the languages come from? No, you will not find any coded character set capable of distinguishing between an Old High German "A", an Old English "A", and a modern "A". This is _not_ a result of the imperfect level of development of coded character sets, however. Your mistake is a confusion of text representation levels: 1) In all existing coded character sets only the content of a text is encoded. This is what you get if you use _plain text_: Only those distinctions necessary to make the text legible is coded. 2) To also keep such _rich text properties_ of text as italicization, boldness, smaller or bigger character size, language-correct choice of glyphs, correct hyphenation behavior, you can't remain on the basic plain text level, but must enter a higher rich text level. 3) In most existing rich text formats these text properties are represented by some kind of mark-up of the plain text. This is certainly the case for the SGML-based TEI encoding system developed to meet the needs of linguists. 4) There are also very sound technical reasons for not including a bit for each binary rich text property in the bit sequence representing a character in a coded character set. These properties, including language, do very seldom vary between each character. They are constant for a chunk of the text, sometimes of considerable length. > >I _would_ support making the country code into something that should be used > >only if it is absolutely necessary to disambiguate different usages of the > >same language. e.g. French and French-Canadian which have different > >capitalisation rules I believe. ... > - - - > Hmmm... I don't see this. > > Isn't capitalization done within the text? `a' is a different character > code than `A' after all... Not capitalization perhaps, but hyphenation rules may be different, and they are important when text is displayed with a different window width or font than that used originally. /Olle --Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Thursday, 11 November 1993 13:35:38 UTC