Re: authoring @lang and @dir (was 3.6. The root element) from Robert Burns on 2007-08-07 (public-html@w3.org from August 2007)

From: Robert Burns <rob@robburns.com>
Date: Tue, 7 Aug 2007 02:40:52 -0500
To: Robert Burns <rob@robburns.com>
Cc: HTML WG <public-html@w3.org>, Richard Ishida <ishida@w3.org>
Message-Id: <30507EA5-2D54-466D-B4A8-B3F949D4F09A@robburns.com>
On Aug 2, 2007, at 2:51 PM, Robert Burns wrote:

>
> HI Richard,
>
> Thanks for the information on this.
>
> On Aug 2, 2007, at 7:16 AM, Richard Ishida wrote:
>
>>
>>> From: public-html-request@w3.org
>>> [mailto:public-html-request@w3.org] On Behalf Of Robert Burns
>>> Sent: 01 August 2007 07:18
>>
>>> That is a good example. However, the RFC 3066 language codes
>>> allow one to specify both language and different script
>>
>> Note in passing that RFC 3066 didn't allow this, and it is now an  
>> obsolete
>> specification. It was replaced by RFC 4646, which does allow for  
>> scripts to
>> be specified, though only when absolutely necessary to distinguish  
>> usage,
>> not as a matter of course.
>> (See http://www.w3.org/International/articles/language-tags/ for more
>> details.)
>
> I guess I knew that. I pulled RFC 3066 out of the current HTML5  
> draft. I thought I noticed some things missing that I expected to  
> see in the language codes RFC. Also, I understand scripts do not  
> need to be specified, however I would say there are some cases  
> where it is absolutely necessary to distinguish usage.
>
>>
>>
>>> variants. So Hebrew written with the Latin script could be
>>> designated by lang='iw- LATN' (dir='LTR'); standard Hebrew as
>>> lang='iw' (dir='RTL'); Turkish as lang='tr-LATN'; and
>>> Turikish in Arabic as lang='tr-Arab' (dir='RTL').
>>
>> Note in passing that iw is an obsolete code for Hebrew, you should  
>> now use
>> 'he'. See
>> http://people.w3.org/rishida/utils/subtags/index.php? 
>> searchtext=hebrew&submi
>> t=Search&searchtype=2
>
> I did not know the subtag registry was up an runing. That's great  
> news. We should dfinitely link to that from the HTML5 recommendation.
>
>>> With these RFC 3066 language codes everything necessary to
>>> designate directionality is already there. I think the reason
>>> we have both @dir and @lang is so that authors have more
>>> flexibility in how much language detail to provide. Also UAs
>>> do not have to hard-wire the mappings of about RFC 3066
>>> scripts codes to directionality and extract script
>>> information from the language codes. That's just my
>>> speculation on this but perhaps someone else knows more of
>>> the history behind this.
>>
>> See http://www.w3.org/TR/i18n-html-tech-lang/#ri20050208.093646470
>>
>> In fact, the use of these two attributes doesn't always coincide.
>>
>> In a document that is generally in English you may have a small  
>> table that
>> contains only Hebrew or Arabic text.  Although it would make sense  
>> to use
>> @lang once on the <table> element, so that it signifies that all  
>> the text in
>> the table is in a given language and you don't have to repeat it,  
>> you would
>> probably *not* want the table columns to flow from right to left  
>> (as would
>> usually be the case when using dir="rtl" on the table), since this  
>> is an
>> English document. If xml:lang was associated with direction, you  
>> would
>> probably have no control over that.  Same goes for list items.
>>
>> Basically, the two attributes do different jobs. Better reduce  
>> confusion and
>> scope for error by having simple, clear semantics to the attributes.
>
> Thanks. I hadn't considered that use-case.
>
>> It is also perfectly acceptable for people to have been labelling  
>> legacy
>> Azerbaijani content as 'az' until now, and to continue to do so in  
>> the
>> future, but that carries no information about whether they used  
>> the cyrillic
>> (LTR) or arabic (RTL) script, since Azerbaijani uses both.
>
> Here I would disagree. I'm not familiar with which script would be  
> considered the default (as Hebrew would for Hebrew), however, it  
> strikes me that if an author is trying to demarcate written  
> language in a document and there's ambiguity over what script a  
> language code implies, than it would be best practice to include a  
> script code. So I would not agree that it is perfectly acceptable  
> to omit a script where ambiguity exists.
>
>> An IPA (International Phonetic Alphabet) transcription of Hebrew  
>> could well
>> be marked as 'he', but it would be incorrect to assume that the
>> directionality was RTL.
>
> Phonetics in Unicode needs much attention. However, the way  
> phonetics is handled now is as an extension to the Latin script. So  
> phonetically writing Hebrew using IPA would be to write Hebrew in a  
> Latin script (implying left-to-right). It would be poor practice to  
> markup such text with just lang='he' and not lang='he-LATN'.  
> Personally, I'd rather see phonetics introduced as its own script  
> in Unicode with its own RFC 4646 IANA registered script code (I see  
> there are some phonetic related script codes in IANA). By using a  
> single unified phonetic script the various glyphs used by different  
> phonetic alphabets could simply be handled by changing fonts or  
> even a smart font that substitutes glyphs according to phonetic  
> alphabet metadata.
>
>> Hope that helps,
>
> Yes, Thanks.
>
>> PS: Note also that @dir used in DITA, XHTML2, ITS, etc has  
>> additional values
>> of lro (left-right-override) and rlo (right-left-override), which  
>> cannot be
>> expressed by @lang. In fact we could consider making that the case  
>> for
>> HTML5, and deprecating the <bdo> tag, though that is a separate  
>> thread.  If
>> we did, then it would be clearer that @dir has a different role  
>> than @lang.
>
> I must admit I do not understand the justification for including  
> 'rlo' and 'lro' as enumerated values for the @dir attribute. The  
> @dir attribute is very much about structural level directionality.  
> While the Unicode bidirectional algorithm is completely about  
> phrase level directionality (as your table and list item examples  
> illustrate). It make a lot of sense to me to keep these things  
> distinct. That means it would be better to maintain the phrase  
> element BDO in HTML5 or to use the related Unicode bidi control  
> characters.. I'm imagine a lot of discussion has gone on around  
> this issue, but I haven't seen it or been a part of it.

Just to follow up on this issue of the distinction between the @dir  
attribute and the BDO element, I put together an example where the  
@dir is significant at the phrase level. A couple of things are  
apparent from this example. First, the suggestion Maciej made that  
@dir only applies to neutral Unicode characters is not correct. In  
the attached example, the only neutrals are the space characters. The  
@dir attribute basically specified whether this is a run that should  
be treated as a Latin run with some Arabic (ltr) or whether this is  
an Arabic run that should be treated as Latin (rtl).  The rtl, is  
really the way it should probably be marked up. Except the Unicode  
page[1] that I borrowed this from wanted to present several  
languages, one after the other, and all ending with "in <<Language x>>".

Second, the distinction I made between @dir that should apply at the  
structural/block level, and BDO that should apply at the phrase/ 
inline level is the wrong way to put it. Instead, I would say that  
the @dir attribute applies to an outer level of embedding for a  
phrase (though not necessarily the outermost level of embedding at  
the paragraph line-break level), whereas I think the BDO element is  
only needed in very rare cases. I think adding the values 'rlo' and  
'lro' as global attributes makes them available in all sorts of  
places they're just not needed. You might say that having an entire  
element just for bidirectional overrides also places a big importance  
on it, but it's an element that most people can ignore and need never  
concern themselves with it. Even for Hebrew and Arabic the rare  
instances where the BDO is needed have got to be quite scarce indeed  
(the examples I've seen related to part numbers that mix Arabic and  
other scripts in peculiar ways). I think we should make it clear how  
rare the need for the BDO element is, so authors don't needless try  
to use it.

Take care,
Rob

[1]: <http://www.unicode.org/standard/WhatIsUnicode.html>
Attachments

text/html attachment: languageScriptTest.html
Received on Tuesday, 7 August 2007 07:41:24 UTC