HTML 4.0 draft - dirlang.html

Following are my comments to the chapter "Language information and text
direction".

Please excuse my abrupt style. I am in a hurry to get it out. 

Most of the comments are minor, editorial in nature.

1.

>Interpretation of language codes 
>
>In the context of HTML, a language code should be interpreted by
>user agents as a hierarchy of tokens rather than a single token.
>When a user agent adjusts rendering according to language
>information (say, by comparing style sheet language codes and lang
>values), it should always favor an exact match, but should also
>consider matching primary codes to be sufficient. Thus, if the lang
>attribute value of "en-US" is set for the HTML element, a user agent
>should prefer style information that matches "en-US" first, then the
>more general value "US".

The last "US" should be "en".

2.
>Specifying the direction of text: the dir attribute 
>
>Attribute definitions
>
>dir = LTR | RTL
>
>Specifies the default direction for directionally weak or neutral
>text in the element's content (left-to-right or right-to-left) in
>this document. Possible values:
>
>•LTR: Left-to-right text.
>•RTL: Right-to-left text.

This document uses various terms to describe the same thing, namely
the base direction (aka global or block direction). See Unicode 2.0,
page 3-18, 3rd paragraph. In this case, the term "default" has been
used. Later on, the terms "initial" and "primary" are used.

I suggest using "base" throughout.

3. The base direction is not the "default direction for directionally
weak or neutral text". It is the direction to be assigend to neutral
text surrounded by text of differing directions. Neutral text
surrounded by text of the same direction takes that direction, for example
a space between two Hebrew words becomes RTL.


4.
>In addition to specifying the primary language of a document,
>authors may need to specify the default direction of pieces of text
>or the text in the entire document.

For documents the primary language of which is Hebrew or Arabic etc. the
author must specify the base direction of the document as RTL. As
noted below, the lang attribute does not imply the dir attribute.


5.
>The [UNICODE] specification assigns directionality to Unicode
>characters and defines a (complex) algorithm for determining the
>proper directionality of text. If a document does not contain a
>displayable right-to-left

the word "character" is missing here

>, a conforming user agent is not required to apply the [UNICODE]

a space is missing here

>bidirectional algorithm. If a document contains a right-to-left
>character, and if the user agent chooses to display that character,
>the user agent must use the bidirectional algorithm.

I suggest adding "on the entire document".


6.
>Introduction to the bidirectional algorithm 

>Suppose the predominant language of the document containing this
>paragraph is English (left-to-right text).

I suggest to add: The base direction is left to right.

>The correct presentation of this line would be:

>If, on the other hand, the predominant language of the document is
>Hebrew (right-to-left direction),

I suggest to add: the base direction is right to left,


7. I think it should be mentioned that the Unicode bidi algorithm is
applied on each elementary block of text.


8.
>Setting the direction of embedded text 

>The [UNICODE] bidirectional algorithm automatically reverses embedded
>character sequences according to their inherent directionality (as
>illustrated by the previous examples). However, only one level of
>embedding can be accounted for.

This isn't precise. Three levels are provided automatically in the
case of a number within RTL text within LTR text.


9.
>Bidirectionality and character encoding According to [RFC1555] and
>[RFC1556], there are special conventions for the use of "charset"
>parameter values to indicate bidirectional treatment in MIME mail, in
>particular to distinguish between visual, implicit, and explicit
>directionality. The parameter value "iso-8859-8" (for Hebrew) denotes
>visual encoding, "iso-8859-8-i" denotes implicit bidirectionality,
>and "iso-8859-8-e" denotes explicit directionality.
>
>Because HTML uses the full Unicode bidirectionality algorithm,
>conforming documents must be labeled as "iso-8859-8-e". Implicit
>bidirectionality is part of the full Unicode algorithm, so the values
>"iso-8859-8-i" may also be accepted, but should not be used.

Explicit directionality was meant to be ISO 6429. The -i suffix
denotes the Unicode algorithm. See RFC 1556.

The value that should be used is "iso-8859-8-i".


10.
>The other characters, lrm and rlm, are used to disambiguate
>directionality of directionally neutral characters. For example, if a
>double quotation mark comes between an Arabic and a Latin letter, the
>direction of the quotation mark is not clear (is it quoting the
>Arabic text or the Latin text?). The lrm and rlm characters have a
>directional property but no width and no word/line break property.
>Please consult [UNICODE] for more details.

I suggest "force" rather athan "disambiguate". It is not at all
ambiguous, it just isn't always what one may expect. In the case of
the example, the direction of the quotation mark is the base direction
of the block. It this is not the desired behavior, an lrm or an rlm
could be used to change this or to make sure it does not depent on the
base direction.


11.
>Reversed character glyphs: The bidirectional algorithm reverses the
>presentation of a well-defined set of characters such as parentheses
>(see [UNICODE], table 4-7). Except for these characters,
>bidirectionality processing leaves the shape of each glyph
>unaffected. Thus, if you wanted to display the word "MURDER" as it
>would be seen in a mirror (right-to-left character order and reversed
>glyphs), you could use a BDO element with the dir attribute to set
>the text direction to right-to-left order, e.g.,
>
><BDO class="mirror" dir="rtl">MURDER</BDO>
>
>and the class value "mirror" with a matching rule in the style sheet
>to select a special font that displays characters with the reversed
>glyphs.

I don't understand the purpose of this clause ot its relevance.


12.
>Undisplayable characters 
>
>User agents may not be able to render meaningfully all character
>values, for instance, because of the lack of an appropriate font, or
>because a character has a value which is inexpressible with the
>internal character encoding.
>
>Because there are many different things that can be done in such a
>case, this document does not prescribe any specific behavior.
>Depending on the implementation, this may also be handled by the
>underlying display system and not the application itself. This
>specification recommends the following behavior for user agents:
>
>1. Adopt a clearly visible, but unobtrusive mechanism to alert the
>user of missing resources.
>
>2. If the user agent provides a numeric representation of missing
>characters, the hexadecimal (not decimal) form is preferable as this
>is the form used in character set standards (see [ERCS]).

I would suggest a third behavior (and suggest it be the first in order
of priority): Use a language specific method if such is customarily
used. For instance, in English, accents in foreign words are just
dropped if the font or device does not support them. In French French
accents on capital letters are often dropped. In German, ae, oe, ue
and ss are often used when the coon capital lettersrrect letters are
not available. And in Hebrew, points and accents are simply dropped if
they cannot be rendered.

By "dropped" I imply that the user is not informed and there is no
visible indication of the unrenderable character.



--

Jonathan Rosenne
JR Consulting
P O Box 33641, Tel Aviv, Israel
Phone: +972 50 246 522 Fax: +972 9 956 7353
http://ourworld.compuserve.com/homepages/Jonathan_Rosenne/

Received on Saturday, 12 July 1997 20:49:29 UTC