Language and script encoding standards

From: John Cowan <cowan@ccil.org> · Date: Wed, 12 Jul 2006 15:59:02 -0400

I have just sent the following email to Neel Smith with respect to his
page "Developing standards for encoding languages and writing systems
in the editing of Greek and Latin texts".

Dear Dr. Smith:

I write to you as a member of LTRU, the IETF working group responsible
for RFC 3066bis, and as a long-term member of ietf-languages, the mailing
list that actually registers language tags.  I do not, however, speak
officially for either of these.

I have read with interest your page on language and script encoding
at http://chs75.harvard.edu/projects/diginc/techpub/language-script after
Chris Lilley of W3C drew attention to it on the www-international
mailing list.  I'll send all three lists copies of this email and (with
your permission) any reply you send me.

I wish to bring to your attention a variety of minor difficulties with
the statements and suggestions on that page, in the hope that we can
work together to bring about a satisfactory resolution which provides
for all the concerns of the classics community.

1.  The xml:lang attribute technically does not yet support RFC 3066bis,
and RFC 3066bis does not incorporate all the codes of ISO 639-3 (that
must wait until ISO finalizes ISO 639-3 and IETF issues RFC 3066ter).
However, these are mere matters of timing, and in substance there is no
reason why such codes cannot be used immediately.

2.  The distinction between standard Greek and the epichoric alphabets is
not one of "script" as that term is defined in ISO 15924.  The Estrangelo,
Western, and Eastern varieties of Syriac are distinguished there because
they use fundamentally different letter shapes, on a par with the
difference between Carolingian/Antiqua, insular, and Fraktur varieties
of Latin script.  Epichoric alphabets, on the other hand, differ in
orthography rather than in script: they use different conventions for
assigning sounds to Greek letters, and in some cases use additional
letters, just as is the case for English, German, and Icelandic, all
of which share the Latin script.  The same remarks apply to 23-letter
and 26-letter varieties of the Latin language: these are different
orthographies rather than different scripts.

3.  Likewise, the dialects of Ancient Greek were not (as far as I know)
mutually unintelligible, and therefore should not be given separate
language codes in ISO 639-3 according to the principles of that standard.
(The line drawn between Ancient and Modern Greek there is obviously
arbitrary, and is inherited from earlier parts of ISO 639.)

4. Treating beta code and UTF-8 on a par with each other is a confusion
of levels.  UTF-8 (and other kinds of Unicode), like ASCII or the various
ISO 8859 standards, are encodings representing a mapping from characters
to bits.  Beta code, on the other hand, is a transliteration standard
for Ancient Greek, representing a mapping from the Greek character
repertoire to the ASCII repertoire.  There is nothing preventing a
document in beta code from being represented in an encoding other than
ASCII, as long as that encoding supports the ASCII repertoire (as in
practice all encodings do).  There is no need to represent the encoding
of an XML document using xml:lang (you cannot even parse the document
until you have determined its encoding), but there is need to represent
any transliteration standard that is in use.

5.  Fortunately, the variant subtag mechanism of RFC 3066bis provides a
solution for all three of these problems.  By registering variant subtags
(a fairly quick and easy process), it is possible to create tags that
specify epichoric orthographies, dialects, and transliteration methods
to any desired degree of detail.  In order to do this, it would be
necessary to fix an order in which these variants should appear (with
the understanding that any or all may be omitted) and then propose the
variants themselves, each with an associated 5-letter to 8-letter subtag.
We already have variant subtags for the old and new German orthographies,
for Slovenian dialects (as you note), and have discussed transliteration
subtags, though without coming to definite conclusions.

6.  The rules of RFC 3066bis require that "la" rather than "lat" be used
to represent the Latin language (always use 2-letter ISO 639-1 rather
than 3-letter ISO 639 tags when available), and strongly recommend that
"Grek" be omitted from tags beginning with "grc", since Greek is the
normal and usual script for Ancient Greek (grc-Linb would be suitable
for Linear B texts, though).

I would urge you to join ietf-languages using the web page at
http://www.alvestrand.no/mailman/listinfo/ietf-languages and discuss
the matter further in a public forum.

-- 
John Cowan      cowan@ccil.org        http://www.ccil.org/~cowan
        Is it not written, "That which is written, is written"?