- From: John Cowan <cowan@ccil.org>
- Date: Wed, 12 Jul 2006 15:59:02 -0400
- To: ietf-languages@iana.org, www-international@w3.org, ltru@ietf.org
I have just sent the following email to Neel Smith with respect to his page "Developing standards for encoding languages and writing systems in the editing of Greek and Latin texts". Dear Dr. Smith: I write to you as a member of LTRU, the IETF working group responsible for RFC 3066bis, and as a long-term member of ietf-languages, the mailing list that actually registers language tags. I do not, however, speak officially for either of these. I have read with interest your page on language and script encoding at http://chs75.harvard.edu/projects/diginc/techpub/language-script after Chris Lilley of W3C drew attention to it on the www-international mailing list. I'll send all three lists copies of this email and (with your permission) any reply you send me. I wish to bring to your attention a variety of minor difficulties with the statements and suggestions on that page, in the hope that we can work together to bring about a satisfactory resolution which provides for all the concerns of the classics community. 1. The xml:lang attribute technically does not yet support RFC 3066bis, and RFC 3066bis does not incorporate all the codes of ISO 639-3 (that must wait until ISO finalizes ISO 639-3 and IETF issues RFC 3066ter). However, these are mere matters of timing, and in substance there is no reason why such codes cannot be used immediately. 2. The distinction between standard Greek and the epichoric alphabets is not one of "script" as that term is defined in ISO 15924. The Estrangelo, Western, and Eastern varieties of Syriac are distinguished there because they use fundamentally different letter shapes, on a par with the difference between Carolingian/Antiqua, insular, and Fraktur varieties of Latin script. Epichoric alphabets, on the other hand, differ in orthography rather than in script: they use different conventions for assigning sounds to Greek letters, and in some cases use additional letters, just as is the case for English, German, and Icelandic, all of which share the Latin script. The same remarks apply to 23-letter and 26-letter varieties of the Latin language: these are different orthographies rather than different scripts. 3. Likewise, the dialects of Ancient Greek were not (as far as I know) mutually unintelligible, and therefore should not be given separate language codes in ISO 639-3 according to the principles of that standard. (The line drawn between Ancient and Modern Greek there is obviously arbitrary, and is inherited from earlier parts of ISO 639.) 4. Treating beta code and UTF-8 on a par with each other is a confusion of levels. UTF-8 (and other kinds of Unicode), like ASCII or the various ISO 8859 standards, are encodings representing a mapping from characters to bits. Beta code, on the other hand, is a transliteration standard for Ancient Greek, representing a mapping from the Greek character repertoire to the ASCII repertoire. There is nothing preventing a document in beta code from being represented in an encoding other than ASCII, as long as that encoding supports the ASCII repertoire (as in practice all encodings do). There is no need to represent the encoding of an XML document using xml:lang (you cannot even parse the document until you have determined its encoding), but there is need to represent any transliteration standard that is in use. 5. Fortunately, the variant subtag mechanism of RFC 3066bis provides a solution for all three of these problems. By registering variant subtags (a fairly quick and easy process), it is possible to create tags that specify epichoric orthographies, dialects, and transliteration methods to any desired degree of detail. In order to do this, it would be necessary to fix an order in which these variants should appear (with the understanding that any or all may be omitted) and then propose the variants themselves, each with an associated 5-letter to 8-letter subtag. We already have variant subtags for the old and new German orthographies, for Slovenian dialects (as you note), and have discussed transliteration subtags, though without coming to definite conclusions. 6. The rules of RFC 3066bis require that "la" rather than "lat" be used to represent the Latin language (always use 2-letter ISO 639-1 rather than 3-letter ISO 639 tags when available), and strongly recommend that "Grek" be omitted from tags beginning with "grc", since Greek is the normal and usual script for Ancient Greek (grc-Linb would be suitable for Linear B texts, though). I would urge you to join ietf-languages using the web page at http://www.alvestrand.no/mailman/listinfo/ietf-languages and discuss the matter further in a public forum. -- John Cowan cowan@ccil.org http://www.ccil.org/~cowan Is it not written, "That which is written, is written"?
Received on Wednesday, 12 July 2006 19:59:13 UTC