RE: Unicode controls vs markup for text direction from Tex Texin on 2008-02-28 (www-international@w3.org from January to March 2008)

From: Tex Texin <tex@yahoo-inc.com>
Date: Thu, 28 Feb 2008 01:17:20 -0800
To: "Richard Ishida" <ishida@w3.org>
Cc: "WWW International" <www-international@w3.org>
Message-ID: <012AB2B223CB3F4BB846962876F47217EBAE5D@SNV-EXVS08.ds.corp.yahoo.com>
Hi Richard,
 
Thanks for your efforts here. I haven't had time to review in detail and prepare an equally detailed response. We should perhaps set aside some time to work this through offlist and provide some relevant examples.

1) The point about img is that we would like its placement in the flow of the source to map into the onscreen flow. We could say the same thing about embedded spans or other elements. So it becomes a question of whether the unicode direction should be allowed to influence the markup direction.
It is a fundamental question. The argument for not using unicode bidi controls was the potential for conflict.
Any change should provide guidance for how to deal with the conflict and suggest a prioritization.

2) Some elements can provide scrollbars. In a <pre> for example with long lines of text, the browser will show scrollbars. 
If the text is all right to left, it might be nice for the element to recognize this and place the scrollbars as if the attribute dir=rtl had been specified.
<pre>TXET CIBARA</pre> would behave as <pre dir=rtl>TXET CIBARA</pre>.
It is a similar case to the img example.
(I am not proposing we should do this, just offering it for examination.)

3) I am not sure why you worry about invisibility. An editor can make them visible.

4) With respect to search: if I am populating pages from a database or other sources of just plain text, and if they include control codes, but after insertion to markup the control codes are removed, it is not clear to me that the text is maintained in the same order. Yes, they are still logically written, but after embedding in markup the text segments might need to be shuffled to give the same visual sequence. We should walk thru some before and after examples.

5) My argument turns on the fact that 
A) editors today support bidi and provide wysiwyg editing. Therefore markup directional attributes are not needed at the level of string editing and are a liability since it prevents the editing of the source in a wysiwyg fashion.

B) Having to replace the controls for markup is also needless work.

C) There are places where we simply cannot put rtl markup so controls are needed. (eg inside attributes such as title).

I would turn your argument about having to deal with markup in databases around.
We have to deal with bidi controls, so allow them in markup.
In fact, we should define how they behave in markup and how to resolve conflicts between markup and controls.

Best,

tex



-----Original Message-----
From: Richard Ishida [mailto:ishida@w3.org] 
Sent: Wednesday, February 20, 2008 5:40 AM
To: Tex Texin
Cc: 'WWW International'
Subject: Unicode controls vs markup for text direction

Tex, and others,

For the record, I have never felt completely comfortable with the idea of banning the use of control codes outright, and I've usually tried not to be dogmatic about it, but there are a number of ways in which I think the idea of replacing markup with control codes is not necessarily a straightforward solution. 

I also need to get a clearer idea of what exactly you are proposing, Tex.

So let me try to outline what I think you may be suggesting, and some of my concerns with some worked examples.

I'm afraid this necessarily became a little long.





I think we are agreeing that markup is the most effective way of setting directional context at a level above the 'paragraph', or for " higher levels of page layout and flow control" as you put it, eg. on an html tag or a table tag. Not only is markup effective at this level, but control codes are defined to only work on 'paragraphs'. Part of the difficulty in dealing with marked up content is defining what constitutes a paragraph.


Using control codes at the paragraph level
-----------------------------------------
Suppose we have a div tag in a LTR document that has RTL content, ie. a different directional context from the markup above it in the hierarchy.
This could be an entry in an RSS feed, or one blog post in a list of posts on a multilingual blog, or a number of other things, but let's for the sake of discussion, start small with a div like the following in a LTR document.

	 <div lang="ar"><img/> TXET CIBARA</div>

(Note that you'd need some indication of directional context in order to make the image appear to the left of the text. Note also that this example assumes that your editor displays source text intelligently.)

If using markup, you'd put a dir='rtl' on the <div> tag to set this up.

I think you are proposing that it would be better to put an RLE immediately after <div ...ar"> and a PDF immediately before </div>, ie. at the start and end of the enclosed text. (If not, skip to the next section.)

If the div or the content of the div is extracted into a database, these codes would go with it.

If you then decided to enclose the text in a <p> element (still inside the div), you would have to ensure that you moved the <p> markup between the control codes and the <div> markup.  Here's a first issue. That's risky, given that these codes may be invisible. There is a good possibility that people, especially if they don't know much about bidi (eg. non-Arabic people working on the file), might easily screw things up here.  

This issue is down to the invisibility of the controls. In theory you could counter that by using escaped characters, however remember that editors currently make life difficult for you if you use escapes in bidi text, by splitting off the punctuation involved in the escape syntax. Also I've found that editing environments or processing tools often convert these escapes to characters automatically.

Imagine next that the text inside the div was much longer, and we decided to split it into, say, 6 separate <p>'s.  Now we'd have to introduce control codes at the beginning and end of each paragraph to be consistent.  This introduces another issue: this is a lot of work, compared to just leaving the dir on the <div> that surrounds all the content and never having to change it.

Things would get even less clear if you added additional <div>s inside the original one, rather than <p>s, eg. in order to, for example, to add multiple background images. You'd have to get the control codes into the div that represented a paragraph, whereas others don't, ie. whereas <p>s usually represent paragraph boundaries, <div>s may or may not.

Same goes for list items.  You'd need to add controls to each list item, unless a list item contained other lists or p tags etc., in which case you'd need to identify what constitutes a paragraph in each case and add/move the control characters to suit.

If you were just using markup, you'd be fine leaving the dir on the original <div> we talked about. You wouldn't need to do more.

So this issue is about the difficulty of deciding what constitutes a 'paragraph' in marked up text.



Using controls for inline text
------------------------------

If we had some text that was clearly inline, such as: 

The title says "<quote dir="rtl">w3c ,YTIVITCA NOITAZILANOITANRETNI</span>"
in Hebrew.

Then I'd probably be less worried about people using control codes instead of the markup.

There are still some issues from the point of view of the content author,
however:

1. the invisibility of the codes can be a pain, partly because you may not be able see how many of them are there, or where they are, and partly because if you wanted to increase the amount of text in the quotation above, you need to make sure you have done so within the relevant control characters (and there may be more than one set of these).

2. there is commonly markup there already, eg. to delimit a quote or for language labeling, and it's really easy and convenient to add an attribute in that case. 

Note also that even if we recommend that people don't use markup in such a situation, we won't achieve consistency in approach that way, or resolve inconsistencies due to the fact that we can't control user behavior, and we have to deal with legacy content. 

The other thing is that you'd still have to inspect the document hierarchy to ascertain the directionality of the text outside the quote above, so I'm not sure what it would gain to use control codes for the inline stuff.
Surely you could convert the markup to control codes for insertion into the database at the same time as looking up the overall context and adding information about that?


Improving search
----------------

In your mail you mentioned that search would be improved.  Is that relevant?
I think you'd normally ignore control codes when searching and could just as easily ignore the markup.  I think you are searching on characters rather than on bidi information.  The characters are in logical order usually, and if not you have bigger problems than this.


Consistency
-----------

I doubt we will ever be able to achieve complete consistency in approach, even if we change our recommendations.  If you are extracting stuff into a database, you'll still have to be prepared to deal with people who use markup and/or control codes - and there'll also be legacy content.

In fact, if we revise the Unicode Note to say that you can use control codes (which I've personally felt for some time would be appropriate), you'll end up with even more inconsistency in usage.  

There may be individual documents that are more consistent wrt attributes and inline text usage, but you'll still need to inspect the document markup hierarchy for both in order to determine the wider directional context.


Summary
-------

To summarise my current thinking: We should perhaps use a more fine-grained approach to our recommendations. I think that markup is needed for setting directional context in the document flow, but that it isn't a sin to use control codes for inline spans of text.  On the other hand, the decision as to which is the best approach for inline text should probably be left to the author, who may have trouble or may find it easier to work with control characters.

I still can't see how using control characters makes it much easier to work with text in databases. You always have to be prepared to deal with markup, because it's out there. I don't think there's much of an impact on searching either, since you need to ignore both control characters and markup equally if you are searching marked up text.

I think it may be worth lobbying user agent implementers, to request that copy/paste operations convert bidi information to control codes for plain text targets, and editor implementers, to request that they handle markup better for editing.
	
What am I still missing?

RI


============
Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)
 
http://www.w3.org/International/
http://rishida.net/blog/
http://rishida.net/

 

> -----Original Message-----
> From: Tex Texin [mailto:tex@yahoo-inc.com]
> Sent: 15 February 2008 19:50
> To: Richard Ishida
> Cc: WWW International
> Subject: RE: New Working Group Note: Best Practices for XML 
> Internationalization
> 
> Hi Richard, Yves, Felix, Jony, Jeremy,
> 
> It's a bit of chicken and egg- if the W3C recommends markup, then the 
> editor vendors are disincented to provide better control code management.
> 
> The issues you raise with managing the scope across the document 
> perhaps reflect the inadequacy of markup design for rtl languages.
> But most of the scope management issues are easily resolved with css 
> anyway.
> 
> The representation of text should be standardized. I shouldn't have a 
> string in an attribute that has to be written one way and a string in 
> a table cell that is written another. I should be able to extract a 
> string from a database and have it work equally well regardless of the 
> context. I should be able to search for the string with a standardized 
> or normalized representation.
> 
> For higher levels of page layout and flow control, markup is quite 
> appropriate.
> But at the level of simply representing a text string, the fact that 
> we have to say follow the recommendation, except in places where you 
> can't, such as attributes, highlights that the recommendation is misguided.
> 
> Using control codes would provide a standardized representation (for 
> string level) and would work in both markup and in plain text, would 
> simplify search, and would offer a wysiwyg view for anyone using an 
> RTL- capable editor.
> 
> It would simplify implementation since the need to have code that 
> exchanged control codes for markup depending on the context would be 
> eliminated.
> 
> I understand that the authors were following the W3C/Unicode 
> recommendation and I respect both that and the need for due process to 
> change things.
> 
> If we are going to include the reference in a best practices document, 
> perhaps the issues with following the recommendation should also be cited.
> 
> A separate effort to have the recommendation reconsidered should be 
> initiated.
> As for the recommendation being recently updated, it is true, but it 
> was revised to include mention of new unicode characters not as a 
> review of the entire content.
> 
> We should set the bar for a best practices document to be something 
> that implementors can trust. While our recommended support for RTL is 
> workable it isn't optimal or best and I think best practice is in fact 
> to deviate from the recommendation.
> 
> tex
>
Received on Thursday, 28 February 2008 09:18:27 UTC