Re: [Ltru] font features in CSS from Martin J. Dürst on 2009-11-02 (www-font@w3.org from October to December 2009)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Mon, 02 Nov 2009 18:49:59 +0900
To: Peter Constable <petercon@microsoft.com>
CC: Andrew Cunningham <lang.support@gmail.com>, LTRU Working Group <ltru@ietf.org>, www-font <www-font@w3.org>, Håkon Wium Lie <howcome@opera.com>, www-style <www-style@w3.org>, Jonathan Kew <jonathan@jfkew.plus.com>, Stephen Zilles <szilles@adobe.com>, "Adam Twardoch (List)" <list.adam@twardoch.com>
Message-ID: <4AEEAB47.7060708@it.aoyama.ac.jp>
Hello Peter, others,

On 2009/10/31 8:02, Peter Constable wrote:
> Btw, Martin: it’s not clear to me what outcomes you have in mind for bringing the CSS thread into the IETF Languages list.

See below.


> You made statements about HTML lang and xml:lang, and those statements seem to be correct.

Yes. Adam's mail seemed to suggest that ISO 639 3-letter codes for 
well-known languages are or should be used as IETF BCP 47 language tags 
in HTML lang and xml:lang. Going back and rereading Adam's mail, I'm 
sure there was no intent for such a suggestion, but I still think it was 
easy for a third party to read it that way. So I wanted to make sure 
people understood the difference between ISO codes and IETF BCP 47 
language tags.

Also, there was a comment that there was no one-to-one correspondence 
between OpenType "language systems" (put in quotes on purpose) and ISO 
codes, and I wanted to point out that the correspondence should be 
better for IETF BCP 47 language tags than for ISO language codes.

> You suggested changes to the data in the OT tag registry; that registry is only tangentially relevant for this list, I think, though I have commented on your suggestions.

I didn't want to suggest any changes to the OT data, at least not for now.

> Sent: Saturday, October 31, 2009 7:54 AM
> To: Andrew Cunningham; Martin J. Dürst
> Cc: LTRU Working Group; www-font; Håkon Wium Lie; www-style; Jonathan Kew; Stephen Zilles; Adam Twardoch (List)
> Subject: Re: [Ltru] font features in CSS
>
> Indeed, the OT language system tags are about typographic conventions.

Yes. In that sense, they are a misnomer, as John Hudson has mentioned.

Now I have to say that in the language area, we are used to misnomers, 
examples being the LANG environment variable on ***ix systems that 
denotes a locale, or the talk about languages in ICANN settings when 
mostly or exclusively, they mean script, and so on.

The other experience we have made in the work on BCP 47 is that it is 
often possible, and often desirable, to look some phenomena as not 
necessarily language-determined but nevertheless language-related. 
Issues such as sort order, number and monetary formatting, quoting 
conventions, and so on, at first sight seem to be separate from the 
question of whether the text is German or French. But then a German 
reader expects some sort orders and not others, and a French reader 
expects some quoting conventions and not others. And it becomes feasible 
to tag different typical German sort orders as variants of 'de'.

> Now, many languages have a single conventional writing system and a single set of conventions for typography for that writing system, though conventions may different for other languages with writing systems based on the same script.

It is definitely clear that there isn't a one-to-one correspondence. But 
my observance was that the correspondence got better when moving from 
ISO tags to IETF BCP 47 language tags. So one question where LTRU could 
help www-font is whether or to what extent it may make sense to use BCP 
47 language tags to identify typographic conventions as expressed by OT 
"language systems".


> A familiar example is Serbian, which has distinctive italic forms for certain letters making its typographic conventions different from, say, Russian.

Yes. The question is how to get from HTML lang or xml:lang to the OT 
"language systems". For cases such as Serbian, I think the common 
assumption would be that browsers supporting OT would automatically 
activate the "SRB" (Serbian) "language system" for pieces of text that 
are tagged with something like lang='sr' (or sr-*).

The more interesting question is how to deal with Macedonian, assuming 
that Macedonian uses the same typographic conventions and variants as 
Serbian. There are several possibilities:

a) Browsers automatically activate the "SRB" "language system" for texts 
labeled lang='mk' (or mk-*)

b) There's a way in CSS to say: use "SRB" for this text. The selector 
part already exists, the question is just how to create the property 
part, and where to allow it (@font-face and/or general rules). This 
could look something like:
:lang(mk) { opentype-language-system: 'SRB' }

c) Same as b), but without exposing OT tags, essentially saying: Use the 
same conventions as for Serbian. This could look something like:
:lang(mk) { typographic-convention-same-as: 'sr' }
(property names are overly long and descriptive on purpose; instead of 
'sr' in the above example, 'sr-Cyrl' may be more appropriate)

The advantages of c) over b) would be that it is less dependent on a 
specific font technology, and it may be easier on the users, who don't 
have to learn yet one more kind of tag.

> There may also be cases within a single language and writing system of multiple typographic conventions. For instance, Malayalam is taught today using fewer conjunct forms than were used in the past.

Yes. The question here would be whether we can register a variant subtag 
for this distinction at IANA. (Treating this for the moment as a generic 
question, not as a particular request.) In my view, this should 
definitely be possible, as such a distinction may be relevant for 
libraries tagging books, for movies for tagging subtitling, and so on. 
For electronic text in Unicode, it may be more a matter of CSS than of 
language tagging for the text itself, at least if these are simply glyph 
differences not expressed in Unicode.

> In principle, it may be reasonable to say that two or more languages can be described as having a common set of typographic conventions.

Yes indeed. There are several examples of this in the OT registry, such as
Athapaskan ATH apk, apj, apl, apm, apw, nav, bea, sek, bcr, caf, crx, 
clc, gwi, haa, chp, dgr, scs, xsl, srs, ing, hoi, koy, hup, ktw, mvb, 
wlk, coq, ctc, gce, tol, tuu, kkz, tgx, tht, aht, tfn, taa, tau, tcb, 
kuu, tce, ttm, txc

The question here would be whether this can be covered by a collection 
code, whether the list is an artifact of the fact that Peter wasn't able 
to reconstruct the original intent (and as Peter said, maybe there never 
was much of a clear 'original intent' for some of the entries), whether 
this OT "language system" is actually in use or not, and so on.

> For instance, I don’t know of any particular reason why font rendering should differ according to whether the text is English or Spanish.

I don't know about English and Spanish. But I know that French likes 
lots of ligatures, but Italian doesn't like ligatures at all. My 
understanding is that this could be reflected with an OT "language 
system" feature. And I seem to remember that in old times (meaning 
before PostScript printers), the same "Times" font might have been cut 
slightly differently for English, French, and German, to take into 
account the frequencies of different letter combinations and to improve 
readability. (I'm not an expert in typography, and I'm sure many of the 
participants in this discussion from the www-font list can provide more 
information or correct me if necessary.)

> In practice, though, it would be very difficult to manage a system that organized the data that way: it’s far easier to allow for a default OpenType language system tag for every language, keeping in mind that OpenType fonts have a default language system and that an explicit language system only needs to be incorporated into font data and invoked when rendering if there are distinctive typographic behaviours.

Yes. What I'm trying to get at is to what extent we can make use of BCP 
47 language tags (which Web developers already should be familiar with) 
to express the typographic conventions and distinctions that OT 
"language systems" represent.

> Martin suggested that the info Adam Twardoch reported (on some other list than this) should be revised.

My mail may read that way, but it wasn't intended that way.


> In principle, changes such as Martin suggested to use IETF Language Tags with region or script subtags might make sense,

Yes. One reason to include LTRU in this discussion was to see to what 
extent it makes sense.

> but in practice that would be attempting to do something that goes beyond the intent of the OpenType tag registry and that would not be highly feasible: to equate every language system tag with a _specific_ and equivalent IETF language tag.

I agree that there are cases of "language systems" where it would be 
quite difficult. But then some of these may also be cases that may not 
actually be in use.

> The simple explanation is the one Andrew gave: these things are not comparable – unless we want to introduce variant subtags for typographic conventions, and I’m not sure that makes sense.

I'm also not sure it makes sense. But we have distinctions such as 
Fraktur vs. Roman (that one via script subtags rather than variant 
subtags), which may come very close in usage to "Malayalam with many 
ligatures" vs. "Malayalam with few ligatures".

> My add’l explanation is that it would not be a simple task, and I don’t think it would be worth the effort. Thus, such changes will *not* be made.

The page in question is at www.microsoft.com, and so it's Microsoft's 
business of whether they want to change it or not. But the question of 
how to get from HTML/XML and CSS to these features involves BCP 47 
language tags anyway, so the question of correspondence (whether 
one-to-one or not) is a separate one from the question of whether the 
table itself changes or not.

> As for “Chinese Phonetic”, I mentioned above that most of the tags were registered years ago without documentation. Thus, it’s not clear what was meant when ZHP was first submitted. It probably was Pinyin, though that’s not certain. Now, I could go and revise the data in the OT tag registry to make the intent explilcit, describing ZHP as being for “Chinese Pinyin”. But I’ve got to ask: are there really different typographic conventions for Pinyin than for any other Latin-based writing system for Chinese? My guess is probably not.

I seem to remember that the Chinese for Pinyin rather strongly insist on 
particular shapes for the lowercase 'a' and 'g', I think they want those 
that one would in general only see in the Italic style also for the 
Roman variant. That would explain the need for a separate typographic 
convention.

As for verifying whether it's Pinyin or not, one way would be to examine 
a wide range of fonts and see if and how it's used.

> Peter
>
> From: ltru-bounces@ietf.org [mailto:ltru-bounces@ietf.org] On Behalf Of Andrew Cunningham
> Sent: Saturday, October 31, 2009 7:02 AM
> To: Martin J. Dürst
> Cc: Jonathan Kew; www-font; Håkon Wium Lie; www-style; Stephen Zilles; LTRU Working Group; Adam Twardoch (List)
> Subject: Re: [Ltru] font features in CSS
>
>
> 2009/10/30 "Martin J. Dürst"<duerst@it.aoyama.ac.jp<mailto:duerst@it.aoyama.ac.jp>>
>
>                          (OT)    (ISO)
> Chinese Hong Kong       ZHH     zho
> Chinese Phonetic        ZHP     zho
> Chinese Simplified      ZHS     zho
> Chinese Traditional     ZHT     zho
>
>
>
> you are comparing apples and oranges ,as the expression goes. The ISO language codes and BCP47 are about languages.

Well, it's more like oranges and mandarines, as far as I understand.

> The table from teh OT spec is NOT a language idfentifier.

Yes. It's not the same. But it's close, and the question is how close, 
and whether we can and/or should make good use of that closeness.

> It identifies what the OT spec refers to as a language system. According to the spec:
>
> "Language system tags identify the language systems supported in a OpenType Layout font. What is meant by a “language system” in this context is a set of typographic conventions for how text in a given script should be presented. Such conventions may be associated with particular languages, with particular genres of usage, with different publications, and other such factors."
>
> The language system tag could map to one language, to multiple languages, and in unexpected ways. Grouping commonalities in orthographic representation and typesetting traditions are the core aspect as far as I can tell, rather than language identification.

We definitely have examples of variant tags distinguishing orthographic 
representations. There's a difference between orthographic variations 
(many if not most of which are observable in the encoded characters) and 
typographic variations (many if not most of which are not observable in 
the encoded characters). Nevertheless, there's a fuzzy boundary, 
depending among else on personal viewpoint and technology.

Also, it should be pointed out that the main original motivation when 
introducing HTML lang (http://tools.ietf.org/html/rfc2070) and xml:lang 
was to distinguish typographic variations that were not expressible in 
Unicode. The first and foremost example that always came up was the 
distinction between Chinese (simplified and traditional), Japanese, and 
Korean glyphs of one and the same character. The example second in line, 
e.g. in talks at the Unicode Conference, is Serbic.

So in some sense, we have come full circle: We are looking at whether 
the tags used in HTML lang and xml:lang, which were mainly introduced to 
distinguish typographic/glyph variants, can be used to do what OT 
"language systems" are doing, namely distinguish typographic variants.

Of course, language tags have many other uses, too, such as selecting 
text-to-speech engines, selecting spelling checkers, and so on. And 
there have been discussions in the past about what to do for cases where 
one wants a French spell checker, German typographic conventions, and 
English text-to-speech conversion (to use a simplistic example).

These may very well in the future lead so something like the example in 
c) above (:lang(mk) { typographic-convention-same-as: 'sr' }). In the 
past, they led to nothing because there was just not enough of a need 
for such cases.

Regards,   Martin.

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
Received on Monday, 2 November 2009 09:50:59 UTC