- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Mon, 02 Nov 2009 18:49:59 +0900
- To: Peter Constable <petercon@microsoft.com>
- CC: Andrew Cunningham <lang.support@gmail.com>, LTRU Working Group <ltru@ietf.org>, www-font <www-font@w3.org>, Håkon Wium Lie <howcome@opera.com>, www-style <www-style@w3.org>, Jonathan Kew <jonathan@jfkew.plus.com>, Stephen Zilles <szilles@adobe.com>, "Adam Twardoch (List)" <list.adam@twardoch.com>
Hello Peter, others,
On 2009/10/31 8:02, Peter Constable wrote:
> Btw, Martin: it’s not clear to me what outcomes you have in mind for bringing the CSS thread into the IETF Languages list.
See below.
> You made statements about HTML lang and xml:lang, and those statements seem to be correct.
Yes. Adam's mail seemed to suggest that ISO 639 3-letter codes for
well-known languages are or should be used as IETF BCP 47 language tags
in HTML lang and xml:lang. Going back and rereading Adam's mail, I'm
sure there was no intent for such a suggestion, but I still think it was
easy for a third party to read it that way. So I wanted to make sure
people understood the difference between ISO codes and IETF BCP 47
language tags.
Also, there was a comment that there was no one-to-one correspondence
between OpenType "language systems" (put in quotes on purpose) and ISO
codes, and I wanted to point out that the correspondence should be
better for IETF BCP 47 language tags than for ISO language codes.
> You suggested changes to the data in the OT tag registry; that registry is only tangentially relevant for this list, I think, though I have commented on your suggestions.
I didn't want to suggest any changes to the OT data, at least not for now.
> Sent: Saturday, October 31, 2009 7:54 AM
> To: Andrew Cunningham; Martin J. Dürst
> Cc: LTRU Working Group; www-font; Håkon Wium Lie; www-style; Jonathan Kew; Stephen Zilles; Adam Twardoch (List)
> Subject: Re: [Ltru] font features in CSS
>
> Indeed, the OT language system tags are about typographic conventions.
Yes. In that sense, they are a misnomer, as John Hudson has mentioned.
Now I have to say that in the language area, we are used to misnomers,
examples being the LANG environment variable on ***ix systems that
denotes a locale, or the talk about languages in ICANN settings when
mostly or exclusively, they mean script, and so on.
The other experience we have made in the work on BCP 47 is that it is
often possible, and often desirable, to look some phenomena as not
necessarily language-determined but nevertheless language-related.
Issues such as sort order, number and monetary formatting, quoting
conventions, and so on, at first sight seem to be separate from the
question of whether the text is German or French. But then a German
reader expects some sort orders and not others, and a French reader
expects some quoting conventions and not others. And it becomes feasible
to tag different typical German sort orders as variants of 'de'.
> Now, many languages have a single conventional writing system and a single set of conventions for typography for that writing system, though conventions may different for other languages with writing systems based on the same script.
It is definitely clear that there isn't a one-to-one correspondence. But
my observance was that the correspondence got better when moving from
ISO tags to IETF BCP 47 language tags. So one question where LTRU could
help www-font is whether or to what extent it may make sense to use BCP
47 language tags to identify typographic conventions as expressed by OT
"language systems".
> A familiar example is Serbian, which has distinctive italic forms for certain letters making its typographic conventions different from, say, Russian.
Yes. The question is how to get from HTML lang or xml:lang to the OT
"language systems". For cases such as Serbian, I think the common
assumption would be that browsers supporting OT would automatically
activate the "SRB" (Serbian) "language system" for pieces of text that
are tagged with something like lang='sr' (or sr-*).
The more interesting question is how to deal with Macedonian, assuming
that Macedonian uses the same typographic conventions and variants as
Serbian. There are several possibilities:
a) Browsers automatically activate the "SRB" "language system" for texts
labeled lang='mk' (or mk-*)
b) There's a way in CSS to say: use "SRB" for this text. The selector
part already exists, the question is just how to create the property
part, and where to allow it (@font-face and/or general rules). This
could look something like:
:lang(mk) { opentype-language-system: 'SRB' }
c) Same as b), but without exposing OT tags, essentially saying: Use the
same conventions as for Serbian. This could look something like:
:lang(mk) { typographic-convention-same-as: 'sr' }
(property names are overly long and descriptive on purpose; instead of
'sr' in the above example, 'sr-Cyrl' may be more appropriate)
The advantages of c) over b) would be that it is less dependent on a
specific font technology, and it may be easier on the users, who don't
have to learn yet one more kind of tag.
> There may also be cases within a single language and writing system of multiple typographic conventions. For instance, Malayalam is taught today using fewer conjunct forms than were used in the past.
Yes. The question here would be whether we can register a variant subtag
for this distinction at IANA. (Treating this for the moment as a generic
question, not as a particular request.) In my view, this should
definitely be possible, as such a distinction may be relevant for
libraries tagging books, for movies for tagging subtitling, and so on.
For electronic text in Unicode, it may be more a matter of CSS than of
language tagging for the text itself, at least if these are simply glyph
differences not expressed in Unicode.
> In principle, it may be reasonable to say that two or more languages can be described as having a common set of typographic conventions.
Yes indeed. There are several examples of this in the OT registry, such as
Athapaskan ATH apk, apj, apl, apm, apw, nav, bea, sek, bcr, caf, crx,
clc, gwi, haa, chp, dgr, scs, xsl, srs, ing, hoi, koy, hup, ktw, mvb,
wlk, coq, ctc, gce, tol, tuu, kkz, tgx, tht, aht, tfn, taa, tau, tcb,
kuu, tce, ttm, txc
The question here would be whether this can be covered by a collection
code, whether the list is an artifact of the fact that Peter wasn't able
to reconstruct the original intent (and as Peter said, maybe there never
was much of a clear 'original intent' for some of the entries), whether
this OT "language system" is actually in use or not, and so on.
> For instance, I don’t know of any particular reason why font rendering should differ according to whether the text is English or Spanish.
I don't know about English and Spanish. But I know that French likes
lots of ligatures, but Italian doesn't like ligatures at all. My
understanding is that this could be reflected with an OT "language
system" feature. And I seem to remember that in old times (meaning
before PostScript printers), the same "Times" font might have been cut
slightly differently for English, French, and German, to take into
account the frequencies of different letter combinations and to improve
readability. (I'm not an expert in typography, and I'm sure many of the
participants in this discussion from the www-font list can provide more
information or correct me if necessary.)
> In practice, though, it would be very difficult to manage a system that organized the data that way: it’s far easier to allow for a default OpenType language system tag for every language, keeping in mind that OpenType fonts have a default language system and that an explicit language system only needs to be incorporated into font data and invoked when rendering if there are distinctive typographic behaviours.
Yes. What I'm trying to get at is to what extent we can make use of BCP
47 language tags (which Web developers already should be familiar with)
to express the typographic conventions and distinctions that OT
"language systems" represent.
> Martin suggested that the info Adam Twardoch reported (on some other list than this) should be revised.
My mail may read that way, but it wasn't intended that way.
> In principle, changes such as Martin suggested to use IETF Language Tags with region or script subtags might make sense,
Yes. One reason to include LTRU in this discussion was to see to what
extent it makes sense.
> but in practice that would be attempting to do something that goes beyond the intent of the OpenType tag registry and that would not be highly feasible: to equate every language system tag with a _specific_ and equivalent IETF language tag.
I agree that there are cases of "language systems" where it would be
quite difficult. But then some of these may also be cases that may not
actually be in use.
> The simple explanation is the one Andrew gave: these things are not comparable – unless we want to introduce variant subtags for typographic conventions, and I’m not sure that makes sense.
I'm also not sure it makes sense. But we have distinctions such as
Fraktur vs. Roman (that one via script subtags rather than variant
subtags), which may come very close in usage to "Malayalam with many
ligatures" vs. "Malayalam with few ligatures".
> My add’l explanation is that it would not be a simple task, and I don’t think it would be worth the effort. Thus, such changes will *not* be made.
The page in question is at www.microsoft.com, and so it's Microsoft's
business of whether they want to change it or not. But the question of
how to get from HTML/XML and CSS to these features involves BCP 47
language tags anyway, so the question of correspondence (whether
one-to-one or not) is a separate one from the question of whether the
table itself changes or not.
> As for “Chinese Phonetic”, I mentioned above that most of the tags were registered years ago without documentation. Thus, it’s not clear what was meant when ZHP was first submitted. It probably was Pinyin, though that’s not certain. Now, I could go and revise the data in the OT tag registry to make the intent explilcit, describing ZHP as being for “Chinese Pinyin”. But I’ve got to ask: are there really different typographic conventions for Pinyin than for any other Latin-based writing system for Chinese? My guess is probably not.
I seem to remember that the Chinese for Pinyin rather strongly insist on
particular shapes for the lowercase 'a' and 'g', I think they want those
that one would in general only see in the Italic style also for the
Roman variant. That would explain the need for a separate typographic
convention.
As for verifying whether it's Pinyin or not, one way would be to examine
a wide range of fonts and see if and how it's used.
> Peter
>
> From: ltru-bounces@ietf.org [mailto:ltru-bounces@ietf.org] On Behalf Of Andrew Cunningham
> Sent: Saturday, October 31, 2009 7:02 AM
> To: Martin J. Dürst
> Cc: Jonathan Kew; www-font; Håkon Wium Lie; www-style; Stephen Zilles; LTRU Working Group; Adam Twardoch (List)
> Subject: Re: [Ltru] font features in CSS
>
>
> 2009/10/30 "Martin J. Dürst"<duerst@it.aoyama.ac.jp<mailto:duerst@it.aoyama.ac.jp>>
>
> (OT) (ISO)
> Chinese Hong Kong ZHH zho
> Chinese Phonetic ZHP zho
> Chinese Simplified ZHS zho
> Chinese Traditional ZHT zho
>
>
>
> you are comparing apples and oranges ,as the expression goes. The ISO language codes and BCP47 are about languages.
Well, it's more like oranges and mandarines, as far as I understand.
> The table from teh OT spec is NOT a language idfentifier.
Yes. It's not the same. But it's close, and the question is how close,
and whether we can and/or should make good use of that closeness.
> It identifies what the OT spec refers to as a language system. According to the spec:
>
> "Language system tags identify the language systems supported in a OpenType Layout font. What is meant by a “language system” in this context is a set of typographic conventions for how text in a given script should be presented. Such conventions may be associated with particular languages, with particular genres of usage, with different publications, and other such factors."
>
> The language system tag could map to one language, to multiple languages, and in unexpected ways. Grouping commonalities in orthographic representation and typesetting traditions are the core aspect as far as I can tell, rather than language identification.
We definitely have examples of variant tags distinguishing orthographic
representations. There's a difference between orthographic variations
(many if not most of which are observable in the encoded characters) and
typographic variations (many if not most of which are not observable in
the encoded characters). Nevertheless, there's a fuzzy boundary,
depending among else on personal viewpoint and technology.
Also, it should be pointed out that the main original motivation when
introducing HTML lang (http://tools.ietf.org/html/rfc2070) and xml:lang
was to distinguish typographic variations that were not expressible in
Unicode. The first and foremost example that always came up was the
distinction between Chinese (simplified and traditional), Japanese, and
Korean glyphs of one and the same character. The example second in line,
e.g. in talks at the Unicode Conference, is Serbic.
So in some sense, we have come full circle: We are looking at whether
the tags used in HTML lang and xml:lang, which were mainly introduced to
distinguish typographic/glyph variants, can be used to do what OT
"language systems" are doing, namely distinguish typographic variants.
Of course, language tags have many other uses, too, such as selecting
text-to-speech engines, selecting spelling checkers, and so on. And
there have been discussions in the past about what to do for cases where
one wants a French spell checker, German typographic conventions, and
English text-to-speech conversion (to use a simplistic example).
These may very well in the future lead so something like the example in
c) above (:lang(mk) { typographic-convention-same-as: 'sr' }). In the
past, they led to nothing because there was just not enough of a need
for such cases.
Regards, Martin.
--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Monday, 2 November 2009 09:50:59 UTC