- From: Robert Burns <rob@robburns.com>
- Date: Thu, 2 Aug 2007 14:51:49 -0500
- To: HTML WG <public-html@w3.org>
- Cc: Richard Ishida <ishida@w3.org>
HI Richard, Thanks for the information on this. On Aug 2, 2007, at 7:16 AM, Richard Ishida wrote: > >> From: public-html-request@w3.org >> [mailto:public-html-request@w3.org] On Behalf Of Robert Burns >> Sent: 01 August 2007 07:18 > >> That is a good example. However, the RFC 3066 language codes >> allow one to specify both language and different script > > Note in passing that RFC 3066 didn't allow this, and it is now an > obsolete > specification. It was replaced by RFC 4646, which does allow for > scripts to > be specified, though only when absolutely necessary to distinguish > usage, > not as a matter of course. > (See http://www.w3.org/International/articles/language-tags/ for more > details.) I guess I knew that. I pulled RFC 3066 out of the current HTML5 draft. I thought I noticed some things missing that I expected to see in the language codes RFC. Also, I understand scripts do not need to be specified, however I would say there are some cases where it is absolutely necessary to distinguish usage. > > >> variants. So Hebrew written with the Latin script could be >> designated by lang='iw- LATN' (dir='LTR'); standard Hebrew as >> lang='iw' (dir='RTL'); Turkish as lang='tr-LATN'; and >> Turikish in Arabic as lang='tr-Arab' (dir='RTL'). > > Note in passing that iw is an obsolete code for Hebrew, you should > now use > 'he'. See > http://people.w3.org/rishida/utils/subtags/index.php? > searchtext=hebrew&submi > t=Search&searchtype=2 I did not know the subtag registry was up an runing. That's great news. We should dfinitely link to that from the HTML5 recommendation. >> With these RFC 3066 language codes everything necessary to >> designate directionality is already there. I think the reason >> we have both @dir and @lang is so that authors have more >> flexibility in how much language detail to provide. Also UAs >> do not have to hard-wire the mappings of about RFC 3066 >> scripts codes to directionality and extract script >> information from the language codes. That's just my >> speculation on this but perhaps someone else knows more of >> the history behind this. > > See http://www.w3.org/TR/i18n-html-tech-lang/#ri20050208.093646470 > > In fact, the use of these two attributes doesn't always coincide. > > In a document that is generally in English you may have a small > table that > contains only Hebrew or Arabic text. Although it would make sense > to use > @lang once on the <table> element, so that it signifies that all > the text in > the table is in a given language and you don't have to repeat it, > you would > probably *not* want the table columns to flow from right to left > (as would > usually be the case when using dir="rtl" on the table), since this > is an > English document. If xml:lang was associated with direction, you would > probably have no control over that. Same goes for list items. > > Basically, the two attributes do different jobs. Better reduce > confusion and > scope for error by having simple, clear semantics to the attributes. Thanks. I hadn't considered that use-case. > It is also perfectly acceptable for people to have been labelling > legacy > Azerbaijani content as 'az' until now, and to continue to do so in the > future, but that carries no information about whether they used the > cyrillic > (LTR) or arabic (RTL) script, since Azerbaijani uses both. Here I would disagree. I'm not familiar with which script would be considered the default (as Hebrew would for Hebrew), however, it strikes me that if an author is trying to demarcate written language in a document and there's ambiguity over what script a language code implies, than it would be best practice to include a script code. So I would not agree that it is perfectly acceptable to omit a script where ambiguity exists. > An IPA (International Phonetic Alphabet) transcription of Hebrew > could well > be marked as 'he', but it would be incorrect to assume that the > directionality was RTL. Phonetics in Unicode needs much attention. However, the way phonetics is handled now is as an extension to the Latin script. So phonetically writing Hebrew using IPA would be to write Hebrew in a Latin script (implying left-to-right). It would be poor practice to markup such text with just lang='he' and not lang='he-LATN'. Personally, I'd rather see phonetics introduced as its own script in Unicode with its own RFC 4646 IANA registered script code (I see there are some phonetic related script codes in IANA). By using a single unified phonetic script the various glyphs used by different phonetic alphabets could simply be handled by changing fonts or even a smart font that substitutes glyphs according to phonetic alphabet metadata. > Hope that helps, Yes, Thanks. > PS: Note also that @dir used in DITA, XHTML2, ITS, etc has > additional values > of lro (left-right-override) and rlo (right-left-override), which > cannot be > expressed by @lang. In fact we could consider making that the case for > HTML5, and deprecating the <bdo> tag, though that is a separate > thread. If > we did, then it would be clearer that @dir has a different role > than @lang. I must admit I do not understand the justification for including 'rlo' and 'lro' as enumerated values for the @dir attribute. The @dir attribute is very much about structural level directionality. While the Unicode bidirectional algorithm is completely about phrase level directionality (as your table and list item examples illustrate). It make a lot of sense to me to keep these things distinct. That means it would be better to maintain the phrase element BDO in HTML5 or to use the related Unicode bidi control characters.. I'm imagine a lot of discussion has gone on around this issue, but I haven't seen it or been a part of it. Take care, Rob
Received on Thursday, 2 August 2007 19:52:33 UTC