A few more comments about "What you need to know about the bidi algorithm and inline markup"

The following comments apply to http://www.w3.org/International/tutorials/new-bidi-xhtml/Overview-inline.en.php

1) Several terms are used more or less synonymously (but are not explained anywhere):
- directional property
- directionality
- strongly typed
[For reference: Unicode UAX#9 ( http://www.unicode.org/reports/tr9/ ) uses the term "bidirectional character type" in Table 4, and also the term "bidi class" (e.g. in the text for HL6).]
Here is an example where these terms are used improperly: in "Directional runs", we find:
"When text with different directionality is mixed inline, the bidi algorithm makes a separate directional run out of each sequence of contiguous characters with the same directionality."
IMHO, the characters of a directional run don't have the same directionality, they have the same *direction*.
I suggest the following usage:
- "directionality" for the abstract property of affecting reordering for rendering.
- "bidi class" or "directional type" (but use only one throughout the document) instead of "directional property".
- "direction" for the actual course of progression of one or more characters, as determined by the bidirectional algorithm. In a bidi context, the direction can be LTR or RTL.
The text should be scanned for all occurrences of "direction*" and "type*", which should be replaced by the normalized terms.

2) In "Directional runs", the examples do not include the spaces at language boundaries in any directional run. This might be misleading. If the reason is that neutral characters are not addressed until the next section, then  I suggest to add a note saying that the handling of neutral characters is ignored in this section and will be addressed in the next one.

3) In the same section, change "you will change the order of the directional runs" to "you will change the displayed order of the directional runs".

4) In section "Neutral characters", we find "They (spaces and punctuation) are therefore classed as neutral or weak characters."
If I am not mistaken, Unicode uses the term "weak" only for numbers, associated punctuation and special beasts like PDF, BN and NSM, *not* for spaces or general punctuation.

5) In the same section, the explanation of neutral resolution does not cover the case of the string with LTR base direction "abc DEF 123" and the space between DEF and 123. It is at the boundary between directional runs but should not be assigned the base direction.
The current explanation is not incorrect, since it applies to neutrals between "strongly typed characters that have different directionality" and numbers are not strongly typed characters, but this case, which is not rare, is not addressed.

6) In section "Numbers", change "Note also that, alongside a number, certain otherwise neutral characters, such as currency symbols, will be treated as part of the number rather than a neutral" to "Note also that, alongside a number, certain otherwise neutral characters, such as currency symbols, will be treated as part of the number rather than neutrals".

7) In section "LRM/RLM", the names of the characters should not include an hyphen before the word MARK.
In the same paragraph, there is an extraneous right parenthesis after ‏

8) Same section, "an left-to-right page" should be "a left-to-right page".

9) In Use Case 2, "surround the the book title" should be "surround the book title".

10) In "Additional examples", the phrase "between the last RTL letter 'ب' (on the left)‌" is not clear to the naïve reader, because the mentioned Arabic letter is nowhere directly on the left of the exclamation mark. I suggest "the last letter (in logical order) of the Arabic text".

11) In the advanced notes for "Telephone numbers, MAC addresses, etc.", I believe that it is LRM (and not RLM) which can be inserted before the number to solve these cases.

12) In "mirrored characters", the parentheses circled in green confuse the reader. In fact, this example is too convoluted for what it intends to show (character mirroring) because it adds the complexity of 2 adjacent but logically separate opposite direction phrases. Better replace it with an example which shows only character mirroring.

13) The text of the link "Best Practices for Authoring HTML: Handling Right-to-left Scripts" is not identical to the article title (extraneous "Best Practices").


Shalom (Regards),  Mati


-----Original Message-----
From: Norbert Lindenberg [mailto:w3@norbertlindenberg.com] 
Sent: Monday, May 14, 2012 11:51 PM
To: Richard Ishida; Aharon Lanin
Cc: Norbert Lindenberg; member-i18n-core@w3.org
Subject: Review of "What you need to know about the bidi algorithm and inline markup"

Hi Richard and Aharon,

I read your article "What you need to know about the bidi algorithm and inline markup" [1], and overall found it very informative and well-written. I'd be happy to see it go to wide review.

Some nits:

- It might help to start the article by telling readers why they would want to read it: It enables them to handle situations correctly where text with different writing directions is mixed within a paragraph or other HTML block. Few people are interested in the Unicode bidi algorithm by itself.

- The article contains a few references to other materials on bidi. It might be a good idea to mention them at the beginning of the article and tell readers which one helps them solve which problem.

- The "how the bidi algorithm works" section should start with a description of the purpose of the algorithm and what it actually does. "Characters and directional typing" are the starting point and should come first, before the base direction as additional input.

- It's a bit sad to see all these inline images. In cases where they're annotated with directions or sequence numbers, images are probably unavoidable, but where it's just plain text using HTML 4 markup, can we really not trust browsers to render it?

- There are a few text samples in Arabic and Hebrew without translations. Curious minds might want to know.

- In a few places there are statements such as "you'd expect". Since I don't read Arabic, Hebrew, Farsi, or Urdu, I don't actually have well-defined expectations. I'd like to know what people who speak these languages expect, or simply what's correct behavior.

- It would be nice to have expected/actual rendering examples indicated with icons, e.g., ☺/☹, as is already done for bad code. That's especially important when one of them is missing, as in the first examples of "embedding changes to the base direction" and "where the algorithm needs help".

- The section title "where the algorithm needs help" is a bit odd, as the preceding text already includes several examples where it needs (and receives) help.

- When upper case English is used instead of Arabic, I'm not quite sure whether word order is the same between the two (translating "introduction to C++" to Japanese, for example, would result in a phrase with "C++" at the beginning). A statement to that effect would help.

- In use case 2, the actual rendering includes "W3C", while source and expected rendering don't.

- The examples with punctuation only deal with trailing punctuation. What about leading punctuation, as in "SPANIARDS GREET EACH OTHER WITH '¡hola!'."?

- "handling unknown text" has the statement "you can use the new bidirectional features in HTML5 markup". That's not the only possible technique, and since this section only introduces the issues in the use cases, the statement isn't needed.

- Use case 4 starts with a rendering without the logical sequence of the text and without stating whether it's the actual or the expected rendering. It would help to provide this information right away (some of it comes later).

- Two notes say "features are not yet widely supported in browsers". Can we point to some browsers that already implement them, so authors know where to start with testing?

- When mentioning open-source code for direction estimation, can we point to some? I know of these:
http://yuilibrary.com/gallery/show/intl-bidi
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/Bidi.html#getBaseDirection(java.lang.CharSequence)
http://icu-project.org/apiref/icu4c/ubidi_8h.html#aeb1fd15743833278cc11906cd5a48aef

- I wasn't familiar with the term "worked example", and even after looking it up I'm not sure it tells readers more than just plain "example".

- For "HTML4", the "4" is a version number, and so there should be a space.

- In the HTML 4 part of "use case 4 solutions", the solution of "applying the appropriate directional information by scripting" means "so that it becomes one of the previous use cases".

- In the "advanced usage notes" of that case, "with a strong left-to-right character" really means "with a strongly directional character of the direction opposite to the intended direction of the inserted text as a whole".

- The following paragraph might make it clearer that it's offering an alternative to what was proposed before: "you can use dir='auto' on the existing element rather than wrapping the injected phrase in bdi".

- Can't the additional examples just be added to the use cases, rather than being exiled to the end of the article?

- In "punctuation at the end of an opposite-direction phrase", the actual rendering has quotes that the expected one doesn't.

- The HTML5 code for this example and the next has "dir='rtl'" while the accompanying text talks about "dir='auto'".

- The text explaining the actual rendering of the MAC address might mention that the strongly directional "aa" switches from RTL to LTR, but there's no way for the reader to tell which parts of :06:01:02 belong to the RTL section at the beginning and which to the LTR section at the end.

Best regards,
Norbert

[1] http://www.w3.org/International/tutorials/new-bidi-xhtml/Overview-inline.en.php

Received on Thursday, 17 May 2012 13:12:22 UTC