Re: Markup of scientific (biological) names, linking to multilingual pages, etc.

Thanks all for the great replies and sorry for the late reaction: just when I was enjoying the useful comments and suggestions and wanted to start putting them in practice my laptop decided it needed a vacation. That was Thursday afternoon - by now it's early Monday evening and with a new video card (on-site service) I can work again. (Yes, that's quick - but I still had a pretty frustrating weekend... :( )

Jukka Korpela:
>> My aim is to mark up languages correctly
>
>That's a noble goal, and I personally try to use language markup within
>reasonable limits, but I think it needs to be said that current browser or
>other support is _very_ limited. That is, you shouldn't expect much
>practical gain now or in the near future. The only _accessibility_ impact
>that I know of is that IBM Home Page Reader recognizes lang attributes and
>can switch its reading mode accordingly - for a few languages. It's very
>nice when it works, but it really applies to a rather small set of
>browsing situations.

Well, it's good to know there is at least one system that does something "practical" with language changes in a document; for me that means it's useful to markup languages even if only some of them can be handled for now - I hope that this will increase.

>> Most of the site will be bilingual (with mostly separate English and
>> Dutch pages), but other languages pop up a lot in the text, too, since
>> it's a travel journal.
>
>As a rule, bilingual or multilingual _pages_ should be avoided. _Sites_
>should have each page in one language. There are several reasons to this.


I agree that as a rule bilingual pages should be avoided - however in this case I decided to have one single home page for the site which is bilingual (just a very short descriptive text) which links to further quite separate Dutch and English versions.
In addition, each page (except this home page) will have a <link rel="alternate" ...> link to the version in the other language. (The Dutch version is mostly done, the English isn't yet).

>> But what language is an English transliteration of a Russian
>> transliteration (or version!) of an Uzbek name?
>
>At the theoretical level, transliteration is between writing system, not
>languages. Thus, the Russian name for Moscow is still in Russian when
>transliterated into Latin alphabet, Moskva. But when a name is adapted
>into another language, so that the pronunciation and/or spelling is
>clearly changed, it would be adequate to consider this as a language
>change. Thus, I would use <span lang="en">Moscow</span>, <span
>lang="ru">Moskva</span>, <span lang="fi">Moskova</span>, etc.

Ah yes, if only it were as simple as always a transliteration. I didn't add "(or version!)" for nothing. :)
Take the capital of Xinjiang - on most maps you'll probably find this as Ürümqi, and you should pronounce the "q" as "ch" as you would for other Chinese transliterations (if you see Urumqi - that's definitely incorrect). However, the name is actually Uyghur, not Chinese, and a more proper transliteration of that would probably be Ürümchi (I'm not 100% sure though - I can't read Arabic characters). The Han Chinese cannot really pronounce this name though, so if you take an internal flight to that city you'll find it announced in Chinese characters (which I don't know enough yet to recognize) and in Latin characters as "Wu Lu Mu Chi" - which would be an (English? UN?) transliteration of the Chinese _version_ of the name. So, where I mention this Chinese version I'd indeed mark it up as <span lang="zh">Wu Lu Mu Chi</span> It took some hard thinking before I realized that over there was indeed the check-in desk for the flight to Ürümqi. I agree it's still Chinese even if it's a transliteration of Chinese. But "rümqi"? Latin transliteration of a supposedly Chinese name which actually is Uyghur? Probably <span lang="ug">Ürümqi</span> but I am less sure of that one. (So, no markup for that name.) And that's just one example :)
Your Finnish "Moskova" is easier to recognize, I think, if only because at least it starts the same, even if the number of syllables doesn't match  and an extra vowel is inserted (as with Wu Lu Mu Chi).

Jon Hanna:
>> For one class of names I do have a real problem though: how does
>> one mark up scientific names for plants, birds, animals, etc?
>> It's certainly a kind of language (though not _really_ Latin - so
>> although there is an ISO language code for Latin (la) I cannot
>> use that, I think).
>
>I passed a query on about this to the IETF Languages list. So far opinion
>seems to say that "la" is indeed the correct RFC1766/RFC3066 code to use for
>Latin as used in biological binominals.

Thanks for passing this on! I really wouldn't have guessed that "Latin" would apply here (I know, some Latin grammar applies in naming rules, but it's artificial, highly formal, and much more limited than real Latin). This nicely solves my problem of marking up the *language* for these names. (Especially in combination with Nick's and Jukka's suggestions.)

Jukka Korpela:
>> For now, I've chosen to mark up as in this example: <span
>> class="sci">Citellus fulvus</span> (that's a Yellow Souslik, in case
>> you're interested), with my stylesheet taking care of properly
>> italicizing such names (as required by the rules for scientific names).
>
>I agree with Nick's suggestion to use <i> rather than <span> here. After
>all, we very much like to have the names italicized; this should not
>depend on style sheets. We would like to have structured markup like
><taxon>, but we haven't. Using <i> is the best shot. But I wouldn't call
>it really _semantic_.
>
>(And I'd use class="taxon", but that's fairly irrelevant - it's just a
>name, except that it might evolve into some kind of convention that might
>be marginally useful.)

Nick Kew:
>IMO you'd be better using <i class="sci"> - more semantics.

Excellent! I have actually been avoiding <i> for years (and even convinced an authoring tool manufacturer to use <em></em> as default for Ctrl+i instead of <i></i>) because it's "not semantic" - so much so that I had simply overlooked or forgotten the fact that <i> can actually be useful (although I agree with Jukka it's not really semantic - it's certainly more direct and nicely avoids the need for a stylesheet while maintaining the required italics).

I also like Jukka's suggestion to use "taxon" (the thing) instead of "sci" (a made-up language code); of course this doesn't apply only to binomials - Laridae (a family) is a taxon, too.

It was just a matter of minutes (a search and replace function that supports regular expressions) to change them all. So, <span class="sci">Citellus fulvus</span> has now become <i class="taxon" xml:lang="la>Citellus fulvus</i>, etc. I'm much happier with this solution. Thanks for these ideas.

Jukka Korpela:
>> Another problem can be hreflang for the target language of a page I'm
>> linking to: what to use if *that* is a multilingual page?
>
>The theoretical answer is hreflang="mul". The "mul" code is the ISO 639-2
>code for 'multiple languages': "The language code mul (for multiple
>languages) should be applied when several languages are used and it is not
>practical to specify all the appropriate language codes." And by HTML
>definition, we _cannot_ specify more than one language code in an hreflang
>attribute. (This might be an oversight. The HTTP header Content-Language,
>which is what hreflang logically corresponds to, allows for a list of
>language codes.)

Ah, I didn't know about "mul". I have copies of several (unofficial) pages I downloaded, organized by language code or language name, but I never thought of looking for "multiple languages" as a keyword. I just found one of these actually does mention that! (Duh.)

But then the question becomes when to use it. I think it's useful for hreflang in an anchor or link tag, but for my own bilingual home page I've chosen to simply (and only) mark up the separate sections as Dutch and English. Should I add "mul" to the <html> tag? (And -not related to accessibility directly- how would a search engine treat "mul"?)

I did double-check the (X)HTML standard before asking this question: I would have expected to be able to define a list here (as you can do for font names, or class names for instance) - but alas, no, I remembered correctly that for language this wasn't possible. hreflang="mul" will do (sort of) - but is much less precise (and useful!) than hreflang="ca es en" would have been. With the latter form a visitor would be able to decide beforehand whether or not it's useful to follow a link at all, with "mul" you'd pretty much have to in order to find out whether there's anything you can read. I hope this is an oversight which will be corrected eventually.

>In practical terms, there's hardly any browser that uses hreflang in any
>way. Well, it could be used in an attribute selector in CSS, but that's
>not very relevant here.

I've found that Mozilla actually does use it; for link tags the hreflang language code is "translated" to a language name and displayed along with title text as a tooltip; for anchor links it's displayed as part of the element properties (when you ask for them). At least for link tags it only recognizes a two-letter code though: "en" is displayed as "English" but "en-US" is not recognized at all and neither is "mul" as I just found. (I still need to do a more systematic check which language codes it actually recognizes.)

For this particular site I'm working on I've chosen to use hreflang for all external links - even if the target language is the same as the source language, for consistency's sake. Now I just need to go through my code to find where I left it out (so far) because I didn't know what to do for a multilingual target. So many links are to "foreign" language pages here it seemed simpler and clearer to just indicate the target language for all. (On a site where that would be an exception I'd probably choose to mark up only the exceptions.)

Jukka Korpela:
>> Things like alt text and table summaries on a multilingual page can be
>> fun, too. When there are only two languages on a page, you could just
>> use both - but what if there are many more?
>
>If you ask me, they should use the language of the content of the element.
>Of course, a table might contain several languages. But it's "user
>interface", like headings, are probably in one language only.

Well, yes, but again things aren't always that simple. For my single bilingual home page I wanted something visual to indicate what part of the world it's about; I could have used (and considered using) one image, in which case I would have needed an alt attribute in two languages - but the trip was through several countries; I ended up using a table with a lot of portraits (not quite a layout table, not quite a data table...) and decided I'd use a summary because the image content is just a little more than decorative. Each image has as its al text only a country code; and the summary explains it's a set of portraits and how they indicate the different countries - in two languages. So in this case there _is_ no "language of the content of the element":

summary="A collage of 6 x 6 portraits of people in Central Asia; alt texts give ISO country code. Een collage van 6 x 6 portretten van mensen in Centraal-Azi&euml;; alt tekst bevat de ISO landcode"

Similarly, the page's title tag contains a short title in two languages. I'm not quite convinced my solution is ideal (I don't think there _is_ and "ideal" solution) - but I tried a lot of approaches and this is the best I could come up with.


Cheers,
-- 
Marjolein Katsma
HomeSite Help - http://hshelp.com/ - Extensions, Tips and Tools
The Bookstore - http://books.hshelp.com/ - Books for webmasters and webrookies

Received on Monday, 3 February 2003 15:45:19 UTC