RE: xml:lang question, markup for things like 'kursee', 'arigato'? from Richard Ishida on 2004-06-16 (www-international@w3.org from April to June 2004)

From: Richard Ishida <ishida@w3.org>
Date: Wed, 16 Jun 2004 16:28:53 +0100
To: "'Misha Wolf'" <Misha.Wolf@reuters.com>, <www-international@w3.org>
Cc: "'Dan Brickley'" <danbri@w3.org>
Message-Id: <20040616152852.BBC454EF60@homer.w3.org>
Yes, I think language tagging is soemtimes not as straightforward as it may
seem.

Web Accessibility folks require use of lang (or xml:lang) tagging to help
voice browsers work out how to pronounce or otherwise deal with the text.
This is unlikely to be relevant here, especially if a non-standard
transcription is used, because different orthographic rules would be
expected.  [In fact, if you were to use a clever enough transcription you
might actually want the voice browser to think it was English, eg. If you
are approximating the sound like 'cur-sea' (for 'kursi' - ok, it's not
ideal).]  The point is that the tagging is used for a practical reason here
which may not align with semantics.

In other cases you might apply language information to assist in formatting.
Suppose you had a document containing Arabic vocabulary in Arabic and
transcribed form, though the introduction and instructions, etc., were in
English. If you wanted to increase the size of the text in an arabic font,
but not the transcription, you'd want to make a difference using :lang.  But
you'd need to think about how to mark up the transcription. If the Arabic
text was marked as 'ar' and the Latin transcription was marked as 'ar-latn'
it would still pick up the :lang rule.

Then there's the question of when is a word or phrase in a different
language or not.  Should 'resume' be marked as French?  What about 'a
certain "je ne sais quoi"'?  I think it's a judgement call for many such
things, and sometimes decisions might be based on considerations such as 'do
I think other people's voice browsers will handle this correctly, or should
I call it out?'.

RI




============
Richard Ishida
W3C

contact info:
http://www.w3.org/People/Ishida/ 

W3C Internationalization:
http://www.w3.org/International/ 
 
 

> -----Original Message-----
> From: www-international-request@w3.org 
> [mailto:www-international-request@w3.org] On Behalf Of Misha Wolf
> Sent: 16 June 2004 15:12
> To: www-international@w3.org
> Cc: Dan Brickley
> Subject: RE: xml:lang question, markup for things like 
> 'kursee', 'arigato'?
> 
> 
> I'm not at all sure about John's and Jon's answers.  As it 
> happens, I was pondering the very same question just 20 mins 
> before Dan's mail arrived.  In my case, I was trying to 
> decide what xml:lang values to use for brief Turkish phrases 
> which have been degraded to the Latin alphabet as used for English.
> Both the Turkish writing system and the English writing 
> system use the Latin script.  It would surely not be helpful 
> to mark both the original phrase and the degraded version as 
> "tr-Latn"?
> 
> Misha
> 
> 
> -----Original Message-----
> From: www-international-request@w3.org
> [mailto:www-international-request@w3.org] On Behalf Of Jon Hanna
> Sent: 16 June 2004 14:44
> To: Dan Brickley
> Cc: www-international@w3.org
> Subject: Re: xml:lang question, markup for things like 'kursee',
> 'arigato'?
> 
> 
> 
> Quoting Dan Brickley <danbri@w3.org>:
> 
> > An xml:lang question... If I have a string that's the
> > transliteration of something in, say, Arabic or Japanese, do I use
> > xml:lang="ja" the same way as if it'd been in Japanese 
> characters? Or
> is
> > there an idiom to indicate transliteration?
> > 
> > eg 'kursee' is an anglo-friendly tranliteration of the arabic
> > for 'chair'... what xml:lang to wrap around it?
> 
> Currently there you would mark them as Japanese or Arabic 
> respectively.
> It seems
> likely (i.e. almost definite) that RFC3066's replacement will encode
> script
> information (in the mean time there are a handful of registered tags
> with
> script information, sr-Cyrl, sr-Latn, uz-Cyrl, uz-Latn, az-Arab,
> az-Cyrl,
> az-Latn).
> 
> > (BTW what's the correct way to refer to these terms? 'phonetic
> spellings
> > in roman alphabet'? Or, er, latin? I get confused embarrasingly easy
> by
> > this stuff.)
> 
> "The Latin script" seems the most common expression these 
> days, but I've
> never
> seen "Roman Alphabet" get flames. I don't think "Roman" is applied to
> Latin
> variants like Fraktur, Gaelic or Carolingian scripts.
> 
> > It might well be that what I'm asking goes beyond the 
> limited reach of
> > xml:lang, and a higher level representation is needed to capture
> > everything I'm trying to say. But still, I'd like to know what if
> > anything I ought to be saying at the xml:lang level...
> 
> In the meantime use xml:lang="ja", xml:lang="ar" etc..
> 
> -- 
> Jon Hanna
> <http://www.hackcraft.net/>
> "...it has been truly said that hackers have even more words for
> equipment failures than Yiddish has for obnoxious people." - 
> jargon.txt
> 
> 
> 
> -----------------------------------------------------------------
>         Visit our Internet site at http://www.reuters.com
> 
> Get closer to the financial markets with Reuters Messaging - for more
> information and to register, visit http://www.reuters.com/messaging
> 
> Any views expressed in this message are those of  the  individual
> sender,  except  where  the sender specifically states them to be
> the views of Reuters Ltd.
>
Received on Wednesday, 16 June 2004 11:28:54 UTC