Re: authoring @lang and @dir (was 3.6. The root element)

HI Richard,

Thanks for the information on this.

On Aug 2, 2007, at 7:16 AM, Richard Ishida wrote:

>
>> From: public-html-request@w3.org
>> [mailto:public-html-request@w3.org] On Behalf Of Robert Burns
>> Sent: 01 August 2007 07:18
>
>> That is a good example. However, the RFC 3066 language codes
>> allow one to specify both language and different script
>
> Note in passing that RFC 3066 didn't allow this, and it is now an  
> obsolete
> specification. It was replaced by RFC 4646, which does allow for  
> scripts to
> be specified, though only when absolutely necessary to distinguish  
> usage,
> not as a matter of course.
> (See http://www.w3.org/International/articles/language-tags/ for more
> details.)

I guess I knew that. I pulled RFC 3066 out of the current HTML5  
draft. I thought I noticed some things missing that I expected to see  
in the language codes RFC. Also, I understand scripts do not need to  
be specified, however I would say there are some cases where it is  
absolutely necessary to distinguish usage.

>
>
>> variants. So Hebrew written with the Latin script could be
>> designated by lang='iw- LATN' (dir='LTR'); standard Hebrew as
>> lang='iw' (dir='RTL'); Turkish as lang='tr-LATN'; and
>> Turikish in Arabic as lang='tr-Arab' (dir='RTL').
>
> Note in passing that iw is an obsolete code for Hebrew, you should  
> now use
> 'he'. See
> http://people.w3.org/rishida/utils/subtags/index.php? 
> searchtext=hebrew&submi
> t=Search&searchtype=2

I did not know the subtag registry was up an runing. That's great  
news. We should dfinitely link to that from the HTML5 recommendation.

>> With these RFC 3066 language codes everything necessary to
>> designate directionality is already there. I think the reason
>> we have both @dir and @lang is so that authors have more
>> flexibility in how much language detail to provide. Also UAs
>> do not have to hard-wire the mappings of about RFC 3066
>> scripts codes to directionality and extract script
>> information from the language codes. That's just my
>> speculation on this but perhaps someone else knows more of
>> the history behind this.
>
> See http://www.w3.org/TR/i18n-html-tech-lang/#ri20050208.093646470
>
> In fact, the use of these two attributes doesn't always coincide.
>
> In a document that is generally in English you may have a small  
> table that
> contains only Hebrew or Arabic text.  Although it would make sense  
> to use
> @lang once on the <table> element, so that it signifies that all  
> the text in
> the table is in a given language and you don't have to repeat it,  
> you would
> probably *not* want the table columns to flow from right to left  
> (as would
> usually be the case when using dir="rtl" on the table), since this  
> is an
> English document. If xml:lang was associated with direction, you would
> probably have no control over that.  Same goes for list items.
>
> Basically, the two attributes do different jobs. Better reduce  
> confusion and
> scope for error by having simple, clear semantics to the attributes.

Thanks. I hadn't considered that use-case.

> It is also perfectly acceptable for people to have been labelling  
> legacy
> Azerbaijani content as 'az' until now, and to continue to do so in the
> future, but that carries no information about whether they used the  
> cyrillic
> (LTR) or arabic (RTL) script, since Azerbaijani uses both.

Here I would disagree. I'm not familiar with which script would be  
considered the default (as Hebrew would for Hebrew), however, it  
strikes me that if an author is trying to demarcate written language  
in a document and there's ambiguity over what script a language code  
implies, than it would be best practice to include a script code. So  
I would not agree that it is perfectly acceptable to omit a script  
where ambiguity exists.

> An IPA (International Phonetic Alphabet) transcription of Hebrew  
> could well
> be marked as 'he', but it would be incorrect to assume that the
> directionality was RTL.

Phonetics in Unicode needs much attention. However, the way phonetics  
is handled now is as an extension to the Latin script. So  
phonetically writing Hebrew using IPA would be to write Hebrew in a  
Latin script (implying left-to-right). It would be poor practice to  
markup such text with just lang='he' and not lang='he-LATN'.  
Personally, I'd rather see phonetics introduced as its own script in  
Unicode with its own RFC 4646 IANA registered script code (I see  
there are some phonetic related script codes in IANA). By using a  
single unified phonetic script the various glyphs used by different  
phonetic alphabets could simply be handled by changing fonts or even  
a smart font that substitutes glyphs according to phonetic alphabet  
metadata.

> Hope that helps,

Yes, Thanks.

> PS: Note also that @dir used in DITA, XHTML2, ITS, etc has  
> additional values
> of lro (left-right-override) and rlo (right-left-override), which  
> cannot be
> expressed by @lang. In fact we could consider making that the case for
> HTML5, and deprecating the <bdo> tag, though that is a separate  
> thread.  If
> we did, then it would be clearer that @dir has a different role  
> than @lang.

I must admit I do not understand the justification for including  
'rlo' and 'lro' as enumerated values for the @dir attribute. The @dir  
attribute is very much about structural level directionality. While  
the Unicode bidirectional algorithm is completely about phrase level  
directionality (as your table and list item examples illustrate). It  
make a lot of sense to me to keep these things distinct. That means  
it would be better to maintain the phrase element BDO in HTML5 or to  
use the related Unicode bidi control characters.. I'm imagine a lot  
of discussion has gone on around this issue, but I haven't seen it or  
been a part of it.

Take care,
Rob

Received on Thursday, 2 August 2007 19:52:33 UTC