draft-iri-bidi-guidelines, comments... from Phillips, Addison on 2011-08-17 (public-iri@w3.org from August 2011)

From: Phillips, Addison <addison@lab126.com>
Date: Wed, 17 Aug 2011 07:48:12 -0700
To: "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <131F80DEA635F044946897AFDA9AC3476A953236E1@EX-SEA31-D.ant.amazon.com>

Hello Martin, Larry, et al,

I did a read through on this important document. Generally, it looks good, although some of the examples seem scary. I'm not going to tangle with design decisions here, just direct comments on the text, which follow:

1. Section 1, intro, 2nd para. You say:

--
Because of the complex interaction between the logical
representation, the visual representation, and the syntax of a Bidi
IRI, a balance is needed between various requirements.
--

I think a more direct statement of the problem is needed. I would suggest something like:

--
In a non-bidi IRI, the logical and visual order of the various IRI parts and path elements is consistent across the entire string. In a Bidi IRI, the logical representation, visual representation, and syntax might not be the same, making it more difficult for users to read, enter, or evaluate the whole. Thus a balance is needed between various requirements.
--

2. Section 2, 1st para. A requirement says

"bidirectional IRIs MUST be in full logical order"

"full logical order" is not a term I'm familiar with. It suggests there is a partial logical order, for example. Just say "logical order".

3. Section 2, 2nd para. There is a requirement:

--
Bidirectional IRIs MUST be rendered in the same way as they would be if they were in a left-to-right embedding;
--

Properly, *all* IRIs must follow this rule.

4. Section 2, 3rd para. The following part of this paragraph might seem to the casual reader to be at odds with the MUSTard above:

--
Also, a bidirectional relative IRI reference that only contains
strong right-to-left characters and weak characters and that starts
and ends with a strong right-to-left character and appears in a text
with right-to-left base directionality (such as used for Arabic or
Hebrew) and is preceded and followed by whitespace and strong
characters does not need an embedding.
--

This IRI is drawn using a right-to-left visual order. Thus "../ARABIC" would be drawn as "CIBARA/..". That's true even if one embeds the string in an LRE. However, I think that the point you are illustrating is that the described IRI is not a "Bidi IRI", it is a unidirectional one. This should be made much more clear.

5. Section 2, 4th para. I think the following paragraph is not prescriptive enough:

--
In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
sufficient to force the correct display behavior. However, the
details of the Unicode Bidirectional algorithm are not always easy to
understand. Implementers are strongly advised to err on the side of
caution and to use embedding in all cases where they are not
completely sure that the display behavior is unaffected without the
embedding.
--

You are sending implementers off to read UAX#9 and advising caution without stating what needs doing. Perhaps instead say something like:

--
The Unicode Bidirectional Algorithm is complex and can be difficult to understand. Implementers are advised to err on the side of caution and to provide an additional level of embedding in all cases where it is unclear if the display behavior will be affected without the embedding. Adding extra levels of left-to-right embedding does not harm or change the display of an IRI. In plain text, placing U+200E, LEFT-TO-RIGHT MARK (LRM) before the first character in the IRI is usually sufficient to force the correct display behavior.
--

6. Section 2, 5th para. This paragraph reads:

--
The Unicode Bidirectional Algorithm ([UNI9], section 4.3) permits
higher-level protocols to influence bidirectional rendering. Such
changes by higher-level protocols MUST NOT be used if they change the
rendering of IRIs.
--

I think this isn't quite right. You've already given the example of using HTML's @dir. The point of that would be to change the rendering (appropriately). You are trying to prohibit changing the directionality in a "bad" way. Perhaps:

--
Such changes by higher-level protocols MUST NOT be applied to sub-sections of an IRI or be used to change the base direction of an IRI containing multiple levels of embedding to right-to-left.
--

7. Section 2, last para. This sentence suggests that the bidi controls might sometimes be transmitted, even though the gist of the sentence is that they are not part of the IRI:

--
The bidirectional formatting characters that may be used before or
after the IRI to ensure correct display are not themselves part of
the IRI.
--

8. Section 3. The 'iquery' component in particular seems like one that cannot reasonably be restricted in the way recommended by the SHOULDs. I think it should be called out specifically, even though it often falls under the rubric of "IRIs that are never presented visually". The various segment types can also be troublesome, especially in a RESTful world in which the path is computed.

9. Section 5. Example 8. Why isn't this IRI allowed? I know why it is a problem, but I can't see any mechanism that could prevent it---or even that it is desirable to prevent path elements from ending/starting with weakly directional characters.

10. Section 7 (Security). There are some obvious spoofing mechanisms that can be assembled using Bidi IRIs.

Minor editorial nits:

Section1, intro. Replace "UCS" with "Unicode" for clarity.

Section 1, intro. The relationship between logical and display representation is only sometimes non-trivial. One might also quibble that the logical order is not used for reading (I understand what you're getting at, but it isn't necessary to present reading or spelling here).

Section 1.1, notation. "other letters" should be "other characters", or, perhaps, "other strongly left-to-right characters"

Section 1.1, notation. Arabic and Hebrew are the main bidirectional modern scripts, but others exist, especially historically. Instead of "upper case letters represent Arabic or Hebrew letters that are written right to left", say, perhaps "upper case letters represent strongly right-to-left characters, such as those used to write Arabic and Hebrew"

Section 2, 4th para. Use of non-normative "may" in "In some other cases, using U+200E, LEFT-TO-RIGHT MARK (LRM), may be
sufficient to force the correct display behavior." Consider avoiding the 2119 keywords when non-normative (replace with "can", for example). I know some people are okay with non-normative use of the 2119 words: it's a question of taste. I point it out in case it is not your intention.

Section 2, last para. This prohibition should probably list the code points and character names. Perhaps a table?

--
IRIs MUST NOT contain bidirectional formatting characters
(LRM, RLM, LRE, RLE, LRO, RLO, and PDF).
--

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.

Received on Wednesday, 17 August 2011 14:48:37 UTC