Re: Specifing language direction in RDF from Lars Marius Garshol on 2002-03-01 (www-rdf-interest@w3.org from March 2002)

From: Lars Marius Garshol <larsga@garshol.priv.no>
Date: 01 Mar 2002 23:51:50 +0100
To: <www-rdf-interest@w3.org>
Message-ID: <m3henzhpfd.fsf@pc36.avidiaasen.online.no>
* Jeremy Carroll
|
| RDF simply does not support xhtml:dir and trying to get it to do so
| is mistaken.

If you want to know the base direction of an RDF literal you will need
*some* way to assert the base direction of the literal. Whether you
use an RDF property or something else is not really all that
important, so long as you choose one way to do it. I read Croome as
trying to do it with an RDF property (for text which did not need that
information at all).
 
| Another way of putting bidi in RDF is to put unicode bidi markers
| into the literal. Anyone know how that works? 

Well, actually xhtml:dir performs two different tasks: specifying the
base embedding level of a paragraph, and providing additional
information about the internal structure of a paragraph. In addition
comes the <bdo> element. These three functions (2 for xhtml:dir, one
for <bdo>) have counterparts in the control codes. I'll try to
explain. 

To take an example, suppose you were to have an RDF literal stored as
follows (lower-case is Latin, uppercase Arabic, according to
convention):

  "car is THE CAR in arabic"

This text would be correctly displayed by bidi-aware code, which would
just follow the rules in the bidi algorithm and analyze the text as
follows (numbers are bidi embedding levels): 

  "car is THE CAR in arabic"
   000000011111110000000000

This text would therefore be displayed as follows:

  "car is RAC EHT in arabic"

The logic is that you read LTR while you're reading English, switch to
RTL when reading Arabic, then back again when you continue. So this is
entirely correct, and no codes are needed.

Now suppose that you have the following literal:

  "'car is THE CAR in arabic,' THE ENGLISHMAN SAID"

Here the bidi algorithm would get into trouble unless extra
information were provided somehow, since it assumes that the base
direction of this paragraph is LTR, since the first character with
hard directionality (the 'c') is LTR. (Base direction, and embedding
direction, does *not* refer to the order in which the characters that
make up a word a written, but the order in which the *words* are
written. (Well, roughly, anyway.))

It would therefore end up displaying this as:

  "'car is RAC EHT in arabic,' DIAS NAMHSILGNE EHT"

which is screwed up. You'd read this as "THE ENGLISHMAN SAID ', in
arabic THE CAR car is'". If you tell the system that the overall
direction of text in this paragraph is actually RTL the algorithm will
do better (I'll return to the remaining problem below):

  "DIAS NAMHSILGNE EHT ',in arabic RAC EHT car is'"

This corresponds to having:

  <p xthml:dir="rtl">'car is THE CAR in arabic,' THE ENGLISHMAN
  SAID</p>

in the XHTML. Providing the information in this way corresponds to
using what the bidi algorithm calls "a higher-level protocol", which
basically means that something outside the text itself is telling the
algorithm what the base direction is.

If no higher-level protocol provides any such information the bidi
algorithm will analyze the text and use the direction of the first
character with strong directionality as the base direction. If your
RDF implementation does not support this you could achieve the same
result by putting U+200F (right-to-left mark) first in the literal.
This character is strongly RTL (and would thus cause base direction to
be RTL), but is ignored during display.

The problem with displaying the sentence as:

  "DIAS NAMHSILGNE EHT ',in arabic RAC EHT car is'"

is that it ignores that the part in '...' is English, and that its
overall direction is LTR. That is, "car is" should appear *first*, not
last, since this whole stretch is an English sentence. It should just
appear in normal English order, with the Arabic phrase in the middle
reversed. 

To achieve this we have to inform the bidi algorithm that the English
quote actually introduces a new embedding level (as it's called) which
has an overall direction (from word to word) that's different from the
base direction. With control codes this is done as follows:

  "'<U+202A>car is THE CAR in arabic<U+202C>,' THE ENGLISHMAN SAID"  

The first code is the left-to-right embedding code, which says, here
starts an LTR stretch of text. The second code is the pop directional
formatting (also known as PDF), which is basically an end tag with no
element type name. (It also ends right-to-left embed plus the
overrides.) 

In XHTML you'd do it like this:

  <p ...>'<span xhtml:dir="ltr">car is THE CAR in arabic</span>,' THE
  ENGLISHMAN SAID</p>

In both cases the end result would be the correct display:

  "DIAS NAMHSILGNE EHT ', car is RAC EHT in arabic'"

The <bdo> element (and the corresponding U+202D and U+202E) is used to
indicate text that is stored in visual order, rather than logical
order. This could happen because of conversion from legacy character
encodings, or it could be necessary to format multi-language codes
that would be screwed up if presented with the bidi algorithm. So they
basically turn off bidi for a stretch of text.

If you never get so far as to read this sentence because your brain
exploded while you were reading the stuff higher up I can only say
that I sympathize entirely. It took me a *long* time to grasp this
stuff, and I when working with it I always have the feeling that
something still eludes me. So please take everything above with a
pinch of salt. I am not a lawyer, nor a bidi expert, so it might not
be 100% accurate.

Personally, I think we should all write the same way as it makes life
a *lot* easier. What I think about the people who write top-down
left-to-right, not to mention various creative combinations of
bottom-up with complex page turning instructions I reserve for private
mail. 

| Does it interact with unicode normal form c?

It does not. Unicode normalization was created to deal with a problem
Unicode inherited from legacy character sets. The correct way to write
the Norwegian place name "Ås" in Unicode is "U+0041 U+030A U+0073".
That is, "A combining-ring-above s". However, Unicode inherited U+00C5
(that is, "Å") from ISO 8859-1, which means that you have two ways of
writing the same character. The effect this has on searching and
string matching should be obvious.

So the normalization allows you to turn both ways of writing "Ås" into
the same string. NFC does this by first breaking down precomposed
characters, then composing them. The result would be:

  "U+0041 U+030A U+0073" -> "U+0041 U+030A U+0073" -> "U+00C5 U+0073"
  "U+00C5 U+0073"        -> "U+0041 U+030A U+0073" -> "U+00C5 U+0073"

There's more to it than this, but this is the basic idea. The bidi
control codes do not relate to this in any way, nor does the issue of
bidi layout.

-- 
Lars Marius Garshol, Ontopian         <URL: http://www.ontopia.net >
ISO SC34/WG3, OASIS GeoLang TC        <URL: http://www.garshol.priv.no >
Received on Friday, 1 March 2002 18:40:11 UTC