Unicode controls vs markup for text direction

Tex, and others,

For the record, I have never felt completely comfortable with the idea of
banning the use of control codes outright, and I've usually tried not to be
dogmatic about it, but there are a number of ways in which I think the idea
of replacing markup with control codes is not necessarily a straightforward
solution. 

I also need to get a clearer idea of what exactly you are proposing, Tex.

So let me try to outline what I think you may be suggesting, and some of my
concerns with some worked examples.

I'm afraid this necessarily became a little long.





I think we are agreeing that markup is the most effective way of setting
directional context at a level above the 'paragraph', or for " higher levels
of page layout and flow control" as you put it, eg. on an html tag or a
table tag. Not only is markup effective at this level, but control codes are
defined to only work on 'paragraphs'. Part of the difficulty in dealing with
marked up content is defining what constitutes a paragraph.


Using control codes at the paragraph level
-----------------------------------------
Suppose we have a div tag in a LTR document that has RTL content, ie. a
different directional context from the markup above it in the hierarchy.
This could be an entry in an RSS feed, or one blog post in a list of posts
on a multilingual blog, or a number of other things, but let's for the sake
of discussion, start small with a div like the following in a LTR document.

	 <div lang="ar"><img/> TXET CIBARA</div>

(Note that you'd need some indication of directional context in order to
make the image appear to the left of the text. Note also that this example
assumes that your editor displays source text intelligently.)

If using markup, you'd put a dir='rtl' on the <div> tag to set this up.

I think you are proposing that it would be better to put an RLE immediately
after <div ...ar"> and a PDF immediately before </div>, ie. at the start and
end of the enclosed text. (If not, skip to the next section.)

If the div or the content of the div is extracted into a database, these
codes would go with it.

If you then decided to enclose the text in a <p> element (still inside the
div), you would have to ensure that you moved the <p> markup between the
control codes and the <div> markup.  Here's a first issue. That's risky,
given that these codes may be invisible. There is a good possibility that
people, especially if they don't know much about bidi (eg. non-Arabic people
working on the file), might easily screw things up here.  

This issue is down to the invisibility of the controls. In theory you could
counter that by using escaped characters, however remember that editors
currently make life difficult for you if you use escapes in bidi text, by
splitting off the punctuation involved in the escape syntax. Also I've found
that editing environments or processing tools often convert these escapes to
characters automatically.

Imagine next that the text inside the div was much longer, and we decided to
split it into, say, 6 separate <p>'s.  Now we'd have to introduce control
codes at the beginning and end of each paragraph to be consistent.  This
introduces another issue: this is a lot of work, compared to just leaving
the dir on the <div> that surrounds all the content and never having to
change it.

Things would get even less clear if you added additional <div>s inside the
original one, rather than <p>s, eg. in order to, for example, to add
multiple background images. You'd have to get the control codes into the div
that represented a paragraph, whereas others don't, ie. whereas <p>s usually
represent paragraph boundaries, <div>s may or may not.

Same goes for list items.  You'd need to add controls to each list item,
unless a list item contained other lists or p tags etc., in which case you'd
need to identify what constitutes a paragraph in each case and add/move the
control characters to suit.

If you were just using markup, you'd be fine leaving the dir on the original
<div> we talked about. You wouldn't need to do more.

So this issue is about the difficulty of deciding what constitutes a
'paragraph' in marked up text.



Using controls for inline text
------------------------------

If we had some text that was clearly inline, such as: 

The title says "<quote dir="rtl">w3c ,YTIVITCA NOITAZILANOITANRETNI</span>"
in Hebrew.

Then I'd probably be less worried about people using control codes instead
of the markup.

There are still some issues from the point of view of the content author,
however:

1. the invisibility of the codes can be a pain, partly because you may not
be able see how many of them are there, or where they are, and partly
because if you wanted to increase the amount of text in the quotation above,
you need to make sure you have done so within the relevant control
characters (and there may be more than one set of these).

2. there is commonly markup there already, eg. to delimit a quote or for
language labeling, and it's really easy and convenient to add an attribute
in that case. 

Note also that even if we recommend that people don't use markup in such a
situation, we won't achieve consistency in approach that way, or resolve
inconsistencies due to the fact that we can't control user behavior, and we
have to deal with legacy content. 

The other thing is that you'd still have to inspect the document hierarchy
to ascertain the directionality of the text outside the quote above, so I'm
not sure what it would gain to use control codes for the inline stuff.
Surely you could convert the markup to control codes for insertion into the
database at the same time as looking up the overall context and adding
information about that?


Improving search
----------------

In your mail you mentioned that search would be improved.  Is that relevant?
I think you'd normally ignore control codes when searching and could just as
easily ignore the markup.  I think you are searching on characters rather
than on bidi information.  The characters are in logical order usually, and
if not you have bigger problems than this.


Consistency
-----------

I doubt we will ever be able to achieve complete consistency in approach,
even if we change our recommendations.  If you are extracting stuff into a
database, you'll still have to be prepared to deal with people who use
markup and/or control codes - and there'll also be legacy content.

In fact, if we revise the Unicode Note to say that you can use control codes
(which I've personally felt for some time would be appropriate), you'll end
up with even more inconsistency in usage.  

There may be individual documents that are more consistent wrt attributes
and inline text usage, but you'll still need to inspect the document markup
hierarchy for both in order to determine the wider directional context.


Summary
-------

To summarise my current thinking: We should perhaps use a more fine-grained
approach to our recommendations. I think that markup is needed for setting
directional context in the document flow, but that it isn't a sin to use
control codes for inline spans of text.  On the other hand, the decision as
to which is the best approach for inline text should probably be left to the
author, who may have trouble or may find it easier to work with control
characters.

I still can't see how using control characters makes it much easier to work
with text in databases. You always have to be prepared to deal with markup,
because it's out there. I don't think there's much of an impact on searching
either, since you need to ignore both control characters and markup equally
if you are searching marked up text.

I think it may be worth lobbying user agent implementers, to request that
copy/paste operations convert bidi information to control codes for plain
text targets, and editor implementers, to request that they handle markup
better for editing.
	
What am I still missing?

RI


============
Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)
 
http://www.w3.org/International/
http://rishida.net/blog/
http://rishida.net/

 

> -----Original Message-----
> From: Tex Texin [mailto:tex@yahoo-inc.com]
> Sent: 15 February 2008 19:50
> To: Richard Ishida
> Cc: WWW International
> Subject: RE: New Working Group Note: Best Practices for XML
> Internationalization
> 
> Hi Richard, Yves, Felix, Jony, Jeremy,
> 
> It's a bit of chicken and egg- if the W3C recommends markup, then the
> editor vendors are disincented to provide better control code management.
> 
> The issues you raise with managing the scope across the document perhaps
> reflect the inadequacy of markup design for rtl languages.
> But most of the scope management issues are easily resolved with css
> anyway.
> 
> The representation of text should be standardized. I shouldn't have a
> string in an attribute that has to be written one way and a string in a
> table cell that is written another. I should be able to extract a string
> from a database and have it work equally well regardless of the context. I
> should be able to search for the string with a standardized or normalized
> representation.
> 
> For higher levels of page layout and flow control, markup is quite
> appropriate.
> But at the level of simply representing a text string, the fact that we
> have to say follow the recommendation, except in places where you can't,
> such as attributes, highlights that the recommendation is misguided.
> 
> Using control codes would provide a standardized representation (for
> string level) and would work in both markup and in plain text, would
> simplify search, and would offer a wysiwyg view for anyone using an RTL-
> capable editor.
> 
> It would simplify implementation since the need to have code that
> exchanged control codes for markup depending on the context would be
> eliminated.
> 
> I understand that the authors were following the W3C/Unicode
> recommendation and I respect both that and the need for due process to
> change things.
> 
> If we are going to include the reference in a best practices document,
> perhaps the issues with following the recommendation should also be cited.
> 
> A separate effort to have the recommendation reconsidered should be
> initiated.
> As for the recommendation being recently updated, it is true, but it was
> revised to include mention of new unicode characters not as a review of
> the entire content.
> 
> We should set the bar for a best practices document to be something that
> implementors can trust. While our recommended support for RTL is workable
> it isn't optimal or best and I think best practice is in fact to deviate
> from the recommendation.
> 
> tex
> 

Received on Wednesday, 20 February 2008 13:37:09 UTC