- From: Lars Marius Garshol <larsga@garshol.priv.no>
- Date: 01 Mar 2002 23:51:50 +0100
- To: <www-rdf-interest@w3.org>
* Jeremy Carroll | | RDF simply does not support xhtml:dir and trying to get it to do so | is mistaken. If you want to know the base direction of an RDF literal you will need *some* way to assert the base direction of the literal. Whether you use an RDF property or something else is not really all that important, so long as you choose one way to do it. I read Croome as trying to do it with an RDF property (for text which did not need that information at all). | Another way of putting bidi in RDF is to put unicode bidi markers | into the literal. Anyone know how that works? Well, actually xhtml:dir performs two different tasks: specifying the base embedding level of a paragraph, and providing additional information about the internal structure of a paragraph. In addition comes the <bdo> element. These three functions (2 for xhtml:dir, one for <bdo>) have counterparts in the control codes. I'll try to explain. To take an example, suppose you were to have an RDF literal stored as follows (lower-case is Latin, uppercase Arabic, according to convention): "car is THE CAR in arabic" This text would be correctly displayed by bidi-aware code, which would just follow the rules in the bidi algorithm and analyze the text as follows (numbers are bidi embedding levels): "car is THE CAR in arabic" 000000011111110000000000 This text would therefore be displayed as follows: "car is RAC EHT in arabic" The logic is that you read LTR while you're reading English, switch to RTL when reading Arabic, then back again when you continue. So this is entirely correct, and no codes are needed. Now suppose that you have the following literal: "'car is THE CAR in arabic,' THE ENGLISHMAN SAID" Here the bidi algorithm would get into trouble unless extra information were provided somehow, since it assumes that the base direction of this paragraph is LTR, since the first character with hard directionality (the 'c') is LTR. (Base direction, and embedding direction, does *not* refer to the order in which the characters that make up a word a written, but the order in which the *words* are written. (Well, roughly, anyway.)) It would therefore end up displaying this as: "'car is RAC EHT in arabic,' DIAS NAMHSILGNE EHT" which is screwed up. You'd read this as "THE ENGLISHMAN SAID ', in arabic THE CAR car is'". If you tell the system that the overall direction of text in this paragraph is actually RTL the algorithm will do better (I'll return to the remaining problem below): "DIAS NAMHSILGNE EHT ',in arabic RAC EHT car is'" This corresponds to having: <p xthml:dir="rtl">'car is THE CAR in arabic,' THE ENGLISHMAN SAID</p> in the XHTML. Providing the information in this way corresponds to using what the bidi algorithm calls "a higher-level protocol", which basically means that something outside the text itself is telling the algorithm what the base direction is. If no higher-level protocol provides any such information the bidi algorithm will analyze the text and use the direction of the first character with strong directionality as the base direction. If your RDF implementation does not support this you could achieve the same result by putting U+200F (right-to-left mark) first in the literal. This character is strongly RTL (and would thus cause base direction to be RTL), but is ignored during display. The problem with displaying the sentence as: "DIAS NAMHSILGNE EHT ',in arabic RAC EHT car is'" is that it ignores that the part in '...' is English, and that its overall direction is LTR. That is, "car is" should appear *first*, not last, since this whole stretch is an English sentence. It should just appear in normal English order, with the Arabic phrase in the middle reversed. To achieve this we have to inform the bidi algorithm that the English quote actually introduces a new embedding level (as it's called) which has an overall direction (from word to word) that's different from the base direction. With control codes this is done as follows: "'<U+202A>car is THE CAR in arabic<U+202C>,' THE ENGLISHMAN SAID" The first code is the left-to-right embedding code, which says, here starts an LTR stretch of text. The second code is the pop directional formatting (also known as PDF), which is basically an end tag with no element type name. (It also ends right-to-left embed plus the overrides.) In XHTML you'd do it like this: <p ...>'<span xhtml:dir="ltr">car is THE CAR in arabic</span>,' THE ENGLISHMAN SAID</p> In both cases the end result would be the correct display: "DIAS NAMHSILGNE EHT ', car is RAC EHT in arabic'" The <bdo> element (and the corresponding U+202D and U+202E) is used to indicate text that is stored in visual order, rather than logical order. This could happen because of conversion from legacy character encodings, or it could be necessary to format multi-language codes that would be screwed up if presented with the bidi algorithm. So they basically turn off bidi for a stretch of text. If you never get so far as to read this sentence because your brain exploded while you were reading the stuff higher up I can only say that I sympathize entirely. It took me a *long* time to grasp this stuff, and I when working with it I always have the feeling that something still eludes me. So please take everything above with a pinch of salt. I am not a lawyer, nor a bidi expert, so it might not be 100% accurate. Personally, I think we should all write the same way as it makes life a *lot* easier. What I think about the people who write top-down left-to-right, not to mention various creative combinations of bottom-up with complex page turning instructions I reserve for private mail. | Does it interact with unicode normal form c? It does not. Unicode normalization was created to deal with a problem Unicode inherited from legacy character sets. The correct way to write the Norwegian place name "Ås" in Unicode is "U+0041 U+030A U+0073". That is, "A combining-ring-above s". However, Unicode inherited U+00C5 (that is, "Å") from ISO 8859-1, which means that you have two ways of writing the same character. The effect this has on searching and string matching should be obvious. So the normalization allows you to turn both ways of writing "Ås" into the same string. NFC does this by first breaking down precomposed characters, then composing them. The result would be: "U+0041 U+030A U+0073" -> "U+0041 U+030A U+0073" -> "U+00C5 U+0073" "U+00C5 U+0073" -> "U+0041 U+030A U+0073" -> "U+00C5 U+0073" There's more to it than this, but this is the basic idea. The bidi control codes do not relate to this in any way, nor does the issue of bidi layout. -- Lars Marius Garshol, Ontopian <URL: http://www.ontopia.net > ISO SC34/WG3, OASIS GeoLang TC <URL: http://www.garshol.priv.no >
Received on Friday, 1 March 2002 18:40:11 UTC