New Internet-Draft on Bidirectionality in Identifiers from Martin Duerst on 2001-07-16 (uri@w3.org from July 2001)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 16 Jul 2001 18:32:55 +0900
To: (Recipient list suppressed)
Message-Id: <4.2.0.58.J.20010716182210.03489820@sh.w3.mag.keio.ac.jp>
Hello everybody,

The following draft has been submitted last Friday to the Internet-Draft
editor and should appear soon at
http://www.ietf.org/internet-drafts/draft-duerst-iri-bidi-00.txt.

It has been separated out from
http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-07.txt,
Section 3.2, to allow more in-depth and focused discussion of the
specific problems of bidirectionality.

The draft suggests uri@w3.org as a mailing list for discussion,
but any of the lists that I have sent this mail to is also
okay (please excuse the multiple postings; I made sure that
follow-ups won't be cross-posted).

This is not a final document, in particular Section 5., item 3)
needs more work. Any comments welcome.

Regards,    Martin.

=========================================================================
INTERNET-DRAFT                                              Martin Duerst
                                                       W3C/Keio University
draft-duerst-iri-bidi-00.txt
Expires January 2002                                        July 13, 2001


            Internet Identifiers and Bidirectionality


Status of this Memo

This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups.  Note that other
groups may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time.  It is inappropriate to use Internet- Drafts as reference
material or to cite them other than as "work in progress."

The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.

The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.

This document is not a product of any working group, but should be
discussed on the mailing list <uri@w3.org>. Comments of editorial
nature should be sent directly to the author. For more information
on the topic of this Internet-Draft, please also see [W3C IRI].


Abstract

This memo describes how to deal with Internet identifiers containing
characters from scripts such as Arabic and Hebrew, which use right-to-
left or bidirectional writing. The solution proposed addresses three
different contexts: The purely graphical representation of such
identifiers, e.g. on paper, the embedding of such identifiers into
running text with established rules for bidirectionality, and the
processing and resolution of such identifiers.


0. Change history

Version 00:

This memo has been separated out from [IRI], Section 3.2 to allow
more in-depth and focused discussion of the specific problems of
bidirectionality.


1. Introduction

There is an increased tendency to allow identifiers to use a wide
range of characters from the scripts of the world. The Universal
Character Set (UCS, see [Unicode] and [ISO10646]) makes it easy to
use and exchange such identifiers digitally. With the appropriate
care (similar to the care needed to avoid confusion between '1', 'l',
and 'I' in US-ASCII-based identifiers), such identifiers can also
be exchanged non-digitally, e.g. written down visually on a medium
such as paper. Potential examples of such idenitifiers include
Internationalized Resource Identifiers [IRI], Internationalized
Domain Names [IDN], and internationalized email addresses.

Some characters, in particular those of the Arabic and the Hebrew
script, are written from right to left. Together with characters
written from left to right, or with digits that are written from
left to right even in these scripts, this gives raise to the
mixture of different writing directions, a phenomenon called
bidirectionality. Dealing with bidirectionality is indispesable
for the proper treatment of text written with the Arabic or Hebrew
script. But it is highly complex because user expectations may
depend on context and are often difficult to identify and express.

This memo deals with the specific problems of Internet identifiers
containing rigth-to-left characters, hereafter called bidirectional
identifiers.

The basic paradigm of all modern bidirectional text handling solutions
is the distinction between digital backing store, where text is stored
in logical order, and rendering (display or printing), for which the
necessary reordering is applied according to well-defined rules.
'Logical order' in this context is the order in which the characters
in the text are pronounced or spelled out.

Using logical order in the digital backing store simplifies a large
number of operations, including sorting, searching, text-to-speech
conversion, various other kinds of linguistic processing, input from
keyboards and other devices, and rendering-related operations such as
line breaking and text reflow. The alternative is to use display order
even in the backing store, but this makes some of the operations above
much more complex and others impossible.

For general text (e.g. average prose,...), the Unicode bidirectional
algorithm [UnicodeBidi] is the single widely accepted and used reference
for providing this reordering from logical order to rendering. The
Unicode bidirectional algorithm consists of an implicit part (producing
adequate results in most cases) and explicit formatting characters for
advanced cases.

The Unicode bidirectional algorithm also allows higher-order protocols
o overwrite certain aspects of the algorithm. A case where this has
been done is the 'dir' attribute in [HTML4].

Bidirectional Internet identifiers primarily are used in three different
contexts:

1) In visual form: This includes display on CTR and LCDs as well as
    more permanent visual forms such as printing. At least as far
    as the reading of individual components is concerned, the visual
    form has to use the inherent directionality of the characters used.
    Otherwise, identification, reading, transcription, and so on are
    severely affected.

2) In digital form inside running text (e.g. an IRI or an email
    address in an email or on a web page). It is not always easy
    or possible to distinguish identifiers from other text.

3) In digital form on its own (e.g. in a structured format or
    database of identifiers, or when transmitted for resolution).
    It should be possible to process bidirectional identifiers
    in the same way as other Internet identifiers.

This memo addresses all these three cases as well as the conversion
between them. The specifics of bidirectional text and of identifier
structure make it impossible to design a solution that works without
additional effort (when compared to non-bidirectional identifiers).
However, the solution proposed in this memo is designed to make the
best out of the severe constraints.


2. Notational Conventions

Keywords in all upper-case such as MUST and SHOULD are defined
in [RFC 2119]. For examples, lower-case letters are used for
letters that flow left to right. Upper-case letters stand for
letters that flow from right to left. A left-to-right example
would be 'hello', whereas a right-to-left example would be
'OLLEH'.

For bidirectional formatting characters from [Unicode], the [XML]-style
entitiy notation is used, as follows:

&lrm;    U+200E     LEFT-TO-RIGHT MARK
&rlm;    U+200F     RIGHT-TO-LEFT MARK
&lre;    U+202A     LEFT-TO-RIGHT EMBEDDING
&rle;    U+202B     RIGHT-TO-LEFT EMBEDDING
&pdf;    U+202C     POP DIRECTIONAL FORMATTING
&lro;    U+202D     LEFT-TO-RIGHT OVERRIDE
&rlo;    U+202E     RIGHT-TO-LEFT OVERRIDE

Only the first two are defined in [HTML4]; the others are
replaced by the 'dir' attribute and the <bdo> element.


3. Identifier Structure

Most Internet identifiers have an inherent structure that distinguishes
structural characters (usually punctuation such as '@', '.', ':', '/',
and so on) and payload components (usually formed with plain alphabetic
or alphanumeric characters).

In order to be able to process bidirectional identifiers in the same
way as other identifiers, it is crucial that in the digital
representations, the individual structural characters and identifier
components are stored in the same sequence as for other identifiers.

The main problem to solve for the visual representation of bidirectional
identifiers is whether the general sequence of components and syntax
characters should be from left to right or from right to left, i.e.
whether the right-to-left equivalent of "ftp.example.com" should be
"MOC.ELPMAXE.PTF" or "PTF.ELPMAXE.MOC". The former one may be
seen as more natural in a purely right-to-left context. But there is also
the possibility of mixed identifiers such as "PTF.ELPMAXE.com".
These provide a very strong motivation for maintaining the same
left-to-right overall component sequence for all Internet identifiers.

The Unicode bidirectional algorithm, extremely simplified, tries to
reorder continuous sequences of right-to-left characters between
continuous sequences of left-to-right characters. A third category,
called neutrals, is processed in the same way as surrounding characters.
The main problem for identifiers is that all the structural characters
are treated as neutrals by the Unicode algorithm, which means that they
are moved around together with their context. As an example, the
logical sequence FTP.EXAMPLE.com (corresponding to the example above),
without additional care is displayed as ELPMAXE.PTF.com, which is
obviously highly confusing.


4. Bidirectional Identifiers in Context

4.1 Independently Processed Bidirectional Identifiers

Bidirectional identifiers processed independently, i.e. stored or
transmitted for resolution, MUST be in full logical order both
for the overall structure as well as for the individual components.
They MUST conform directly to the relevant syntax rules.

4.2 Visual Rendering of Bidirectional Identifiers

Bidirectional Identifiers MUST be rendered visually by rendering
each component and each structural character from left to right.
They MUST render each component according to its natural direction
(i.e. left-to-right for components with left-to-right characters,
right-to-left for components with right-to-left characters).

4.3 Bidirectional Identifiers in Textual Context

In textual context, i.e. assuming rendering by the Unicode bidirectional
algorithm, the backing store representation prescribed in Section 4.1
and the visual rendering prescribed in section 4.2 have to be
combined. This is done as follows:

- Each component with right-to-left characters is preceded and
   followed by an &lrm;. This left-to-right mark provides a
   left-to-right context to intervening syntactic characters.

- If the overall context (base directionality) is right-to-left,
   the identifier is preceded by an &lre; and followed by a &pdf;.
   This makes sure that the components of the identifier are
   rendered in left-to-right order. This may also be done by
   using the equivalent features of a higher-order protocol
   (e.g. by using the dir='ltr' attribute in HTML).

4.4 Conversions

Conversion from textual context to visual representation is done
simply by applying the Unicode bidirectional algorithm, i.e. by
passing the whole text to an appropriate rendering engine.

Conversion from processing representation to textual context is
done by adding the necessary formatting characters as described
in Section 4.3.

Conversion from textual context to processing representation is
done by removing the formating characters at the positions
described in Section 4.3. For international domain names, this
can e.g. be integrated in [nameprep].

Conversion from visual representation to processing representation
is done by inputting the identifier, component-by-component from
left to right, using the natural reading order for each component.

 From these three conversions, the remaining conversions can be
easily constructed. Any other procedure that leads to the same results
is also allowed.


5. Restrictions

The definitions and conversions in Section 4 only work under the
following restrictions.

1) A component MUST NOT not use both right-to-left and left-to-right
    characters.

2) A component MUST NOT contain bidirectional formatting characters
    except for those and in those positions as defined in Section 4.3.

3) A component using right-to-left characters MUST NOT use any other
    class of characters (e.g. neutrals or numbers).

Restrictions 1) and 2) are not very severe, in that they do not overly
restrict useful identifiers. Also, trying to remove it would make it
impossible for humans to predict the logical sequence of characters
inside a single component. On the other hand, it would be very desirable
to remove or at least soften restriction 3). Otherwise, it is impossible
to combine Arabic or Hebrew letters with numbers, or to use a hyphen 
between two subcomponents of an Arabic component to avoid the cursive
connection of the two subcomponents. To a certain extent, softening this
restriction should be easily possible by adding additional formating
characters in well defined ways similar to the provisions in Section 4.3.
Feedback on this issue is particularly welcome.


6. Security Considerations

Knowledge of deficiencies of a particular implementation of the above
specification can allow somebody to pretend to resolve a particular
identifier when indeed another identifier is being resolved.


Acknowledgements

The basic idea for the approach proposed in this memo are due to
Francois Yergeau, and go back to around 1995. Discussions with
Stephen Atkin, Paul Hoffman, and many others provided additional
motivation and insight.


Copyright

Copyright (C) The Internet Society, 1997. All Rights Reserved.

This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph
are included on all such copies and derivative works.  However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other
than English.

The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.

This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."


Author's address

           Martin J. Duerst
           W3C/Keio University
           5322 Endo, Fujisawa
           252-8520 Japan
           duerst@w3.org
           http://www.w3.org/People/D%C3%BCrst/
           Tel/Fax: +81 466 49 1170

           Note: Please write "Duerst" with u-umlaut wherever
                 possible, e.g. as "D&#252;rst" in XML and HTML.


References

[HTML4] "HTML 4.01", World Wide Web Consortium,
   <http://www.w3.org/TR/REC-html40>.

[IDN] Internationalized Domain Name (idn) IETF Working Group. For
   furter information, please see
   <http://www.ietf.org/html.charters/idn-charter.html>.

[IRI] L. Masinter, M. Duerst, "Internationalized Resource Identifiers
   (IRI)", Internet Draft, Jan. 2001,
   <http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-07.txt>,
   work in progress.

[ISO10646] ISO/IEC, Information Technology - Universal Multiple-Octet
   Coded Character Set (UCS) - Part 1: Architecture and Basic
   Multilingual Plane, Oct. 2000, with amendments.

[Nameprep] P. Hoffman, M. Blanchet, "Preparation of Internationalized
   Host Names", Internet Draft, Feb. 2001,
   <http://www.ietf.org/internet-drafts/draft-ietf-idn-nameprep-03.txt>,
   work in progress.

[RFC 2119] S. Bradner, "Key words for use in RFCs to Indicate
   Requirement Levels", March 1997.

[Unicode] The Unicode Consortium, "The Unicode Standard, Version 3.1",
   consisting of: "The Unicode Standard, Version 3.0", Addison-Wesley,
   Reading, MA, 2000, and "Unicode Standard Annex #27: Unicode 3.1",
   <http://www.unicode.org/unicode/reports/tr27/>, May 2001.

[UnicodeBidi] The Unicode Consortium, "The Unicode Standard, Version
   3.0", Addison-Wesley, Reading, MA, 2000, Section 3.12, pp. 55-69, also
   available at <http://www.unicode.org/unicode/uni2book/ch03.pdf>
   and "Unicode Standard Annex #9: The Bidirectional Algorithm",
   <http://www.unicode.org/unicode/reports/tr9/tr9-9.html">, March 2001.

[W3C IRI] Internationalization - URIs and other identifiers
   <http://www.w3.org/International/O-URL-and-ident.html>.

[XML] "XML 1.0", World Wide Web Consortium Recommendation,
   <http://www.w3.org/TR/REC-xml#sec-external-ent>.
Received on Monday, 16 July 2001 05:35:52 UTC