RE: authoring @lang and @dir (was 3.6. The root element) from Richard Ishida on 2007-08-07 (public-html@w3.org from August 2007)

From: Richard Ishida <ishida@w3.org>
Date: Tue, 7 Aug 2007 09:18:32 +0100
To: "'Robert Burns'" <rob@robburns.com>
Cc: "'HTML WG'" <public-html@w3.org>
Message-ID: <00e501c7d8cb$8b95a7b0$6501a8c0@rishida>
Just skimming my email because I'm on vacation and have to dash off to other
things in a moment...

Robert, you may find the following useful:

What you need to know about the bidi algorithm and inline markup 
http://www.w3.org/International/articles/inline-bidi-markup/

Creating (X)HTML Pages in Arabic & Hebrew 
http://www.w3.org/International/tutorials/bidi-xhtml/

I18n tests 
http://www.w3.org/International/tests/sec-dir-0.html 
http://www.w3.org/International/tests/test-rtl-chrome-0 
http://www.w3.org/International/tests/sec-inline-bidi-0 


Hope that's of some help until I have more time to respond.

RI

============
Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)
 
http://www.w3.org/People/Ishida/
http://www.w3.org/International/
http://people.w3.org/rishida/blog/
http://www.flickr.com/photos/ishida/
 
 

> -----Original Message-----
> From: Robert Burns [mailto:rob@robburns.com] 
> Sent: 07 August 2007 08:41
> To: Robert Burns
> Cc: HTML WG; Richard Ishida
> Subject: Re: authoring @lang and @dir (was 3.6. The root element)
> 
> 
> On Aug 2, 2007, at 2:51 PM, Robert Burns wrote:
> 
> >
> > HI Richard,
> >
> > Thanks for the information on this.
> >
> > On Aug 2, 2007, at 7:16 AM, Richard Ishida wrote:
> >
> >>
> >>> From: public-html-request@w3.org
> >>> [mailto:public-html-request@w3.org] On Behalf Of Robert Burns
> >>> Sent: 01 August 2007 07:18
> >>
> >>> That is a good example. However, the RFC 3066 language 
> codes allow 
> >>> one to specify both language and different script
> >>
> >> Note in passing that RFC 3066 didn't allow this, and it is now an 
> >> obsolete specification. It was replaced by RFC 4646, which 
> does allow 
> >> for scripts to be specified, though only when absolutely 
> necessary to 
> >> distinguish usage, not as a matter of course.
> >> (See 
> http://www.w3.org/International/articles/language-tags/ for more
> >> details.)
> >
> > I guess I knew that. I pulled RFC 3066 out of the current 
> HTML5 draft. 
> > I thought I noticed some things missing that I expected to 
> see in the 
> > language codes RFC. Also, I understand scripts do not need to be 
> > specified, however I would say there are some cases where it is 
> > absolutely necessary to distinguish usage.
> >
> >>
> >>
> >>> variants. So Hebrew written with the Latin script could be 
> >>> designated by lang='iw- LATN' (dir='LTR'); standard Hebrew as 
> >>> lang='iw' (dir='RTL'); Turkish as lang='tr-LATN'; and Turikish in 
> >>> Arabic as lang='tr-Arab' (dir='RTL').
> >>
> >> Note in passing that iw is an obsolete code for Hebrew, you should 
> >> now use 'he'. See 
> >> http://people.w3.org/rishida/utils/subtags/index.php?
> >> searchtext=hebrew&submi
> >> t=Search&searchtype=2
> >
> > I did not know the subtag registry was up an runing. That's great 
> > news. We should dfinitely link to that from the HTML5 
> recommendation.
> >
> >>> With these RFC 3066 language codes everything necessary 
> to designate 
> >>> directionality is already there. I think the reason we have both 
> >>> @dir and @lang is so that authors have more flexibility 
> in how much 
> >>> language detail to provide. Also UAs do not have to hard-wire the 
> >>> mappings of about RFC 3066 scripts codes to directionality and 
> >>> extract script information from the language codes. 
> That's just my 
> >>> speculation on this but perhaps someone else knows more of the 
> >>> history behind this.
> >>
> >> See http://www.w3.org/TR/i18n-html-tech-lang/#ri20050208.093646470
> >>
> >> In fact, the use of these two attributes doesn't always coincide.
> >>
> >> In a document that is generally in English you may have a 
> small table 
> >> that contains only Hebrew or Arabic text.  Although it would make 
> >> sense to use @lang once on the <table> element, so that it 
> signifies 
> >> that all the text in the table is in a given language and 
> you don't 
> >> have to repeat it, you would probably *not* want the table 
> columns to 
> >> flow from right to left (as would usually be the case when using 
> >> dir="rtl" on the table), since this is an English document. If 
> >> xml:lang was associated with direction, you would probably have no 
> >> control over that.  Same goes for list items.
> >>
> >> Basically, the two attributes do different jobs. Better reduce 
> >> confusion and scope for error by having simple, clear semantics to 
> >> the attributes.
> >
> > Thanks. I hadn't considered that use-case.
> >
> >> It is also perfectly acceptable for people to have been labelling 
> >> legacy Azerbaijani content as 'az' until now, and to 
> continue to do 
> >> so in the future, but that carries no information about 
> whether they 
> >> used the cyrillic
> >> (LTR) or arabic (RTL) script, since Azerbaijani uses both.
> >
> > Here I would disagree. I'm not familiar with which script would be 
> > considered the default (as Hebrew would for Hebrew), however, it 
> > strikes me that if an author is trying to demarcate written 
> language 
> > in a document and there's ambiguity over what script a 
> language code 
> > implies, than it would be best practice to include a script 
> code. So I 
> > would not agree that it is perfectly acceptable to omit a 
> script where 
> > ambiguity exists.
> >
> >> An IPA (International Phonetic Alphabet) transcription of Hebrew 
> >> could well be marked as 'he', but it would be incorrect to assume 
> >> that the directionality was RTL.
> >
> > Phonetics in Unicode needs much attention. However, the way 
> phonetics 
> > is handled now is as an extension to the Latin script. So 
> phonetically 
> > writing Hebrew using IPA would be to write Hebrew in a Latin script 
> > (implying left-to-right). It would be poor practice to markup such 
> > text with just lang='he' and not lang='he-LATN'.
> > Personally, I'd rather see phonetics introduced as its own 
> script in 
> > Unicode with its own RFC 4646 IANA registered script code 
> (I see there 
> > are some phonetic related script codes in IANA). By using a single 
> > unified phonetic script the various glyphs used by 
> different phonetic 
> > alphabets could simply be handled by changing fonts or even a smart 
> > font that substitutes glyphs according to phonetic alphabet 
> metadata.
> >
> >> Hope that helps,
> >
> > Yes, Thanks.
> >
> >> PS: Note also that @dir used in DITA, XHTML2, ITS, etc has 
> additional 
> >> values of lro (left-right-override) and rlo (right-left-override), 
> >> which cannot be expressed by @lang. In fact we could 
> consider making 
> >> that the case for HTML5, and deprecating the <bdo> tag, 
> though that 
> >> is a separate thread.  If we did, then it would be clearer 
> that @dir 
> >> has a different role than @lang.
> >
> > I must admit I do not understand the justification for 
> including 'rlo' 
> > and 'lro' as enumerated values for the @dir attribute. The @dir 
> > attribute is very much about structural level directionality.
> > While the Unicode bidirectional algorithm is completely 
> about phrase 
> > level directionality (as your table and list item examples 
> > illustrate). It make a lot of sense to me to keep these things 
> > distinct. That means it would be better to maintain the 
> phrase element 
> > BDO in HTML5 or to use the related Unicode bidi control 
> characters.. 
> > I'm imagine a lot of discussion has gone on around this 
> issue, but I 
> > haven't seen it or been a part of it.
> 
> Just to follow up on this issue of the distinction between 
> the @dir attribute and the BDO element, I put together an 
> example where the @dir is significant at the phrase level. A 
> couple of things are apparent from this example. First, the 
> suggestion Maciej made that @dir only applies to neutral 
> Unicode characters is not correct. In the attached example, 
> the only neutrals are the space characters. The @dir 
> attribute basically specified whether this is a run that 
> should be treated as a Latin run with some Arabic (ltr) or 
> whether this is an Arabic run that should be treated as Latin 
> (rtl).  The rtl, is really the way it should probably be 
> marked up. Except the Unicode page[1] that I borrowed this 
> from wanted to present several languages, one after the 
> other, and all ending with "in <<Language x>>".
> 
> Second, the distinction I made between @dir that should apply 
> at the structural/block level, and BDO that should apply at 
> the phrase/ inline level is the wrong way to put it. Instead, 
> I would say that the @dir attribute applies to an outer level 
> of embedding for a phrase (though not necessarily the 
> outermost level of embedding at the paragraph line-break 
> level), whereas I think the BDO element is only needed in 
> very rare cases. I think adding the values 'rlo' and 'lro' as 
> global attributes makes them available in all sorts of places 
> they're just not needed. You might say that having an entire 
> element just for bidirectional overrides also places a big 
> importance on it, but it's an element that most people can 
> ignore and need never concern themselves with it. Even for 
> Hebrew and Arabic the rare instances where the BDO is needed 
> have got to be quite scarce indeed (the examples I've seen 
> related to part numbers that mix Arabic and other scripts in 
> peculiar ways). I think we should make it clear how rare the 
> need for the BDO element is, so authors don't needless try to use it.
> 
> Take care,
> Rob
> 
> [1]: <http://www.unicode.org/standard/WhatIsUnicode.html>
>
Received on Tuesday, 7 August 2007 08:16:53 UTC