- From: Richard Ishida <ishida@w3.org>
- Date: Tue, 7 Aug 2007 09:18:32 +0100
- To: "'Robert Burns'" <rob@robburns.com>
- Cc: "'HTML WG'" <public-html@w3.org>
Just skimming my email because I'm on vacation and have to dash off to other things in a moment... Robert, you may find the following useful: What you need to know about the bidi algorithm and inline markup http://www.w3.org/International/articles/inline-bidi-markup/ Creating (X)HTML Pages in Arabic & Hebrew http://www.w3.org/International/tutorials/bidi-xhtml/ I18n tests http://www.w3.org/International/tests/sec-dir-0.html http://www.w3.org/International/tests/test-rtl-chrome-0 http://www.w3.org/International/tests/sec-inline-bidi-0 Hope that's of some help until I have more time to respond. RI ============ Richard Ishida Internationalization Lead W3C (World Wide Web Consortium) http://www.w3.org/People/Ishida/ http://www.w3.org/International/ http://people.w3.org/rishida/blog/ http://www.flickr.com/photos/ishida/ > -----Original Message----- > From: Robert Burns [mailto:rob@robburns.com] > Sent: 07 August 2007 08:41 > To: Robert Burns > Cc: HTML WG; Richard Ishida > Subject: Re: authoring @lang and @dir (was 3.6. The root element) > > > On Aug 2, 2007, at 2:51 PM, Robert Burns wrote: > > > > > HI Richard, > > > > Thanks for the information on this. > > > > On Aug 2, 2007, at 7:16 AM, Richard Ishida wrote: > > > >> > >>> From: public-html-request@w3.org > >>> [mailto:public-html-request@w3.org] On Behalf Of Robert Burns > >>> Sent: 01 August 2007 07:18 > >> > >>> That is a good example. However, the RFC 3066 language > codes allow > >>> one to specify both language and different script > >> > >> Note in passing that RFC 3066 didn't allow this, and it is now an > >> obsolete specification. It was replaced by RFC 4646, which > does allow > >> for scripts to be specified, though only when absolutely > necessary to > >> distinguish usage, not as a matter of course. > >> (See > http://www.w3.org/International/articles/language-tags/ for more > >> details.) > > > > I guess I knew that. I pulled RFC 3066 out of the current > HTML5 draft. > > I thought I noticed some things missing that I expected to > see in the > > language codes RFC. Also, I understand scripts do not need to be > > specified, however I would say there are some cases where it is > > absolutely necessary to distinguish usage. > > > >> > >> > >>> variants. So Hebrew written with the Latin script could be > >>> designated by lang='iw- LATN' (dir='LTR'); standard Hebrew as > >>> lang='iw' (dir='RTL'); Turkish as lang='tr-LATN'; and Turikish in > >>> Arabic as lang='tr-Arab' (dir='RTL'). > >> > >> Note in passing that iw is an obsolete code for Hebrew, you should > >> now use 'he'. See > >> http://people.w3.org/rishida/utils/subtags/index.php? > >> searchtext=hebrew&submi > >> t=Search&searchtype=2 > > > > I did not know the subtag registry was up an runing. That's great > > news. We should dfinitely link to that from the HTML5 > recommendation. > > > >>> With these RFC 3066 language codes everything necessary > to designate > >>> directionality is already there. I think the reason we have both > >>> @dir and @lang is so that authors have more flexibility > in how much > >>> language detail to provide. Also UAs do not have to hard-wire the > >>> mappings of about RFC 3066 scripts codes to directionality and > >>> extract script information from the language codes. > That's just my > >>> speculation on this but perhaps someone else knows more of the > >>> history behind this. > >> > >> See http://www.w3.org/TR/i18n-html-tech-lang/#ri20050208.093646470 > >> > >> In fact, the use of these two attributes doesn't always coincide. > >> > >> In a document that is generally in English you may have a > small table > >> that contains only Hebrew or Arabic text. Although it would make > >> sense to use @lang once on the <table> element, so that it > signifies > >> that all the text in the table is in a given language and > you don't > >> have to repeat it, you would probably *not* want the table > columns to > >> flow from right to left (as would usually be the case when using > >> dir="rtl" on the table), since this is an English document. If > >> xml:lang was associated with direction, you would probably have no > >> control over that. Same goes for list items. > >> > >> Basically, the two attributes do different jobs. Better reduce > >> confusion and scope for error by having simple, clear semantics to > >> the attributes. > > > > Thanks. I hadn't considered that use-case. > > > >> It is also perfectly acceptable for people to have been labelling > >> legacy Azerbaijani content as 'az' until now, and to > continue to do > >> so in the future, but that carries no information about > whether they > >> used the cyrillic > >> (LTR) or arabic (RTL) script, since Azerbaijani uses both. > > > > Here I would disagree. I'm not familiar with which script would be > > considered the default (as Hebrew would for Hebrew), however, it > > strikes me that if an author is trying to demarcate written > language > > in a document and there's ambiguity over what script a > language code > > implies, than it would be best practice to include a script > code. So I > > would not agree that it is perfectly acceptable to omit a > script where > > ambiguity exists. > > > >> An IPA (International Phonetic Alphabet) transcription of Hebrew > >> could well be marked as 'he', but it would be incorrect to assume > >> that the directionality was RTL. > > > > Phonetics in Unicode needs much attention. However, the way > phonetics > > is handled now is as an extension to the Latin script. So > phonetically > > writing Hebrew using IPA would be to write Hebrew in a Latin script > > (implying left-to-right). It would be poor practice to markup such > > text with just lang='he' and not lang='he-LATN'. > > Personally, I'd rather see phonetics introduced as its own > script in > > Unicode with its own RFC 4646 IANA registered script code > (I see there > > are some phonetic related script codes in IANA). By using a single > > unified phonetic script the various glyphs used by > different phonetic > > alphabets could simply be handled by changing fonts or even a smart > > font that substitutes glyphs according to phonetic alphabet > metadata. > > > >> Hope that helps, > > > > Yes, Thanks. > > > >> PS: Note also that @dir used in DITA, XHTML2, ITS, etc has > additional > >> values of lro (left-right-override) and rlo (right-left-override), > >> which cannot be expressed by @lang. In fact we could > consider making > >> that the case for HTML5, and deprecating the <bdo> tag, > though that > >> is a separate thread. If we did, then it would be clearer > that @dir > >> has a different role than @lang. > > > > I must admit I do not understand the justification for > including 'rlo' > > and 'lro' as enumerated values for the @dir attribute. The @dir > > attribute is very much about structural level directionality. > > While the Unicode bidirectional algorithm is completely > about phrase > > level directionality (as your table and list item examples > > illustrate). It make a lot of sense to me to keep these things > > distinct. That means it would be better to maintain the > phrase element > > BDO in HTML5 or to use the related Unicode bidi control > characters.. > > I'm imagine a lot of discussion has gone on around this > issue, but I > > haven't seen it or been a part of it. > > Just to follow up on this issue of the distinction between > the @dir attribute and the BDO element, I put together an > example where the @dir is significant at the phrase level. A > couple of things are apparent from this example. First, the > suggestion Maciej made that @dir only applies to neutral > Unicode characters is not correct. In the attached example, > the only neutrals are the space characters. The @dir > attribute basically specified whether this is a run that > should be treated as a Latin run with some Arabic (ltr) or > whether this is an Arabic run that should be treated as Latin > (rtl). The rtl, is really the way it should probably be > marked up. Except the Unicode page[1] that I borrowed this > from wanted to present several languages, one after the > other, and all ending with "in <<Language x>>". > > Second, the distinction I made between @dir that should apply > at the structural/block level, and BDO that should apply at > the phrase/ inline level is the wrong way to put it. Instead, > I would say that the @dir attribute applies to an outer level > of embedding for a phrase (though not necessarily the > outermost level of embedding at the paragraph line-break > level), whereas I think the BDO element is only needed in > very rare cases. I think adding the values 'rlo' and 'lro' as > global attributes makes them available in all sorts of places > they're just not needed. You might say that having an entire > element just for bidirectional overrides also places a big > importance on it, but it's an element that most people can > ignore and need never concern themselves with it. Even for > Hebrew and Arabic the rare instances where the BDO is needed > have got to be quite scarce indeed (the examples I've seen > related to part numbers that mix Arabic and other scripts in > peculiar ways). I think we should make it clear how rare the > need for the BDO element is, so authors don't needless try to use it. > > Take care, > Rob > > [1]: <http://www.unicode.org/standard/WhatIsUnicode.html> >
Received on Tuesday, 7 August 2007 08:16:53 UTC