W3C home > Mailing lists > Public > public-esw-thes@w3.org > May 2011

Re: Arabic or Hebrew languages (Right to Left Languages) and SKOS, XML,RDF,etc.

From: Richard Ishida <ishida@w3.org>
Date: Thu, 26 May 2011 17:21:37 +0100
Message-ID: <4DDE7E11.1060007@w3.org>
To: Christophe Dupriez <christophe.dupriez@destin.be>
CC: public-esw-thes@w3.org
I'm a bit pushed for time, so here are some very quick notes...

On 26/05/2011 16:12, Christophe Dupriez wrote:
> Hi again to all of you: thank you for the hints!
>
> What exactly happens:
> 1) xml:lang attribute declares the user language targeted by a given XML
> literal (this in XML, RDF or SKOS)
> 2) Unicode characters are carrying by themselves (in their definition)
> the "script" and the direction they must be written with.
> 3) You can find latin words (written Left to Right) in Arabic texts (or
> Chinese texts or Hebrew texts or Thaï texts...) and vice-versa
> 4) It is a practical issue I have: the browsers (and the text editors
> like Notepad) are not taking the good direction if they are not told to
> change direction.

The issue typically only lies with applying the correct 'base direction' 
to the appropriate range of content in bidirectional text. When using 
plain text, you need to use stateful Unicode control characters to 
achieve this, but we strongly recommend that you use markup where 
possible, since this avoids a bunch of potential issues.

For a quick review of the concepts here read "What you need to know 
about the bidi algorithm and inline markup" 
http://www.w3.org/International/articles/inline-bidi-markup/ 
(understanding the effect of base direction on directional runs is the 
critical bit, but also you'll see examples of when the Unicode 
bidirectional algorithm (ie. character based directional info) alone is 
(naturally) insufficient.

We recommend that you use markup with the same semantics as the dir 
attribute in HTML, including a default direction of LTR, and inheritance 
as described.  The ITS specification uses its:dir in its examples, but 
the spec describes 'data categories' ie. conceptual and semantic 
behaviour rather than specific vocabulary, ie. you don't need to use a 
namespace, as long as you implement the behaviour that's comformant.  On 
the other hand, using an attribute called 'dir' makes sense because it 
is recognisable, and avoids potential confusions.

>
> I consider (4) is a browser "bug": sooner or later, browsers will adapt
> the default direction and default alignment (left or right align) by
> themselves depending on the Unicode characters encountered in the text
> written inside a block.

Actually, that would lead to chaos. There are some situations where that 
may be appropriate, but still not totally adequate, as described in the 
Additional Requirements for Bidi in HTML at 
http://www.w3.org/International/docs/html-bidi-requirements/, but in 
that document such an approach is carefully controlled and marked up.

Btw,we see directional markup as pseudo-semantic, ie. it is *not* purely 
presentational.  For example, you should be able to remove CSS styling 
and still read the content, but you are likely to remove significant 
information if you omit the directional markup (which is why we say that 
you should always use dedicated markup rather than just apply styling).

>
> The short term solution ("browser adaptation") may be to check all
> characters (first characters may have only "weak" directionality and
> Arabic words can be hidden in a latin text) to check if they is Arabic
> or Hebrew inside. Then to add a Unicode markup to signal RTL text within
> the literal.

See above.
>
> Left or right alignment? I am wondering if this should not be decided
> based on the target user language rather than on the characters' script.

Alignment is related to the base direction.  For example, in a series of 
short arabic lines containing one line in english a reader may be 
unhappy to see the english line way on the other side of the page from 
the others.

HTH,
RI


>
> Do you agree with this approach (pure data, character sniffing before
> output to add RTL where necessary for current browsers, left/right
> alignment based on xml:lang) ?
>
> Have a nice day!
>
> Christophe
>
>
>
> Le 26/05/2011 16:05, Thad Guidry a écrit :
>> Oops, forgot to include the good tutorial that I have used in the
>> past: http://www.w3.org/International/tutorials/bidi-xhtml/
>>
>> On Thu, May 26, 2011 at 9:01 AM, Thad Guidry <thadguidry@gmail.com
>> <mailto:thadguidry@gmail.com>> wrote:
>>
>>     Christophe,
>>
>>     I personally do not think SKOS or any other structured format
>>     should concern itself with display and presentation, especially
>>     adding control chars within the data itself [1]. Display and
>>     presentation of data should be left to the browser application
>>     itself, and the markup handling.
>>
>>     1. http://www.w3.org/TR/i18n-html-tech-bidi/
>>
>>
>>     On Thu, May 26, 2011 at 4:38 AM, Christophe Dupriez
>>     <christophe.dupriez@destin.be
>>     <mailto:christophe.dupriez@destin.be>> wrote:
>>
>>         Hi!
>>
>>         I would like to know if some best practices has been set up to
>>         support RTL (right to left) languages in XML, RDF or SKOS.
>>
>>         The problem: when displaying Arabic or Hebrew, the browsers
>>         must be told to write from right to left and (ideally) the
>>         text is better displayed aligned on the right rather than the
>>         left.
>>
>>         One may wish that applications not be obliged to make explicit
>>         tests like "if language is Arabic or Hebrew then
>>         RTL+align:right else then LTR+align:left".
>>
>>         What have been done for this? What the community think that
>>         should be done?
>>
>>         I made a test by hand to prepare addition of Arabic to JITA:
>>         http://www.askosi.org/JITA-ar.htm
>>
>>         Other languages of the JITA thesaurus, as used to access E-LIS
>>         (click on concepts in schemas):
>>         http://www.askosi.org/jita
>>
>>         For now, my "feeling" is to add Unicode character x202B before
>>         Arabic and Hebrew labels and Unicode character x202C at the
>>         end (i.e. within the data).
>>         Character x202C is Pop Direction Format: return to the
>>         direction (LTR or RTL) in use when x202B (switch to RTL) was
>>         encountered.
>>
>>         But what others do???
>>
>>         I will be happy to learn about your thought on this topic!
>>
>>         Christophe
>>
>>
>>
>>
>>     --
>>     -Thad
>>     http://www.freebase.com/view/en/thad_guidry
>>
>>
>>
>>
>> --
>> -Thad
>> http://www.freebase.com/view/en/thad_guidry
>

-- 
Richard Ishida
Internationalization Activity Lead
W3C (World Wide Web Consortium)

http://www.w3.org/International/
http://rishida.net/
Received on Thursday, 26 May 2011 16:22:01 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 26 May 2011 16:22:01 GMT