IRI Templates and Bidi Characters

To go along with Joe's URI Template work, I've been working on support
for IRI Templates.  The key differences between URI and IRI templates
are a) the characters allowed within the {...} tokens and the
pct-encoding rules.  Whereas a URI Template is used to produce URI's, an
IRI Template is used to produce IRI's.

As one can expect, there are a number of issues that can make working
with IRI Templates more difficult than URI Templates.  The most
difficult issue is handling of bidi characters.  I've been working on
some rules that I'd like to get some feedback on.

First, here's my ABNF production for IRI Templates:

  ivalue              = *(iunreserved / pct-encoded)
                        ; replacement value for token

  iunreservedsansdash = (alphanum / "." / "_" / "~" / ucschar)
  iarg                = *(reserved / iunreserved / pct-encoded)
  ivarname            = iunreservedsansdash *(iunreserved)
  ivardefault         = ivalue
  ivar                = ivarname [ "=" ivardefault ]
  ivars               = ivar [*(sep ivar)]
  ivarnodefault       = ivarname
  ivarsnodefault      = ivarname [*(sep ivarname)]

  ioperator           = ( append   "|" iarg  "|" ivar  )          /
                        ( prefix   "|" iarg  "|" ivar  )          /
                        ( join     "|" iarg  "|" ivars )          /
                        ( listjoin "|" iarg  "|" ivarnodefault )  /
                        ( opt      "|" iarg  "|" ivarsnodefault ) /
                        ( neg      "|" iarg  "|" ivarsnodefault ) /
                        ( extop    "|" (iarg / range) "|"
                         (ivar /
                          ivars   /
                          ivarnodefault /
                          ivarsnodefault) )

  itoken              = "{" ivar / ioperator "}"

  itemplate           = *(reserved / ipchar / iprivate / itoken )
  itemplate-expansion = IRI / IRI-reference

Within this production, the ivar, ivalue and iarg productions can
contain bidi characters.

The rules for handling bidi chars in an IRI Template are:

1. IRI Templates MUST be stored and transmitted in logical order
2. IRI Templates MUST be rendered using the unicode bidi algorithm
3. The entire IRI Template MUST be rendered as if they were in a LTR
   embedding (preceded by U+202A, and followed by U+202C). This is the
   same as IRI's a defined by RFC3987.  As with IRI's, there is no
   need to explicitly use this embedding if the template can be
   displayed properly without it.
4. Each pipe-delimited segment in the {...} token is treated as a
   separate component.
5. The first component (the op component) is always rendered LTR
6. The second component (the arg component) is always rendered LTR, as
   if they were in an LTR override (preceded by 0x202D, and followed
   by 0x202C).  This ensures that the arg will always be rendered
   in logical order (LTR) in order to avoid any possible confusion.
7. The third component (the var component) is segmented depending on the
   number of vars and specified default values. The following
   illustrates the segmentation

   <LRM>var</LRM>=<LRO>default</LRO>,<LRM>var</LRM>=<LRO>default</LRO>

   Note that like the arg component, the default is always rendered
   using a LTR override.  This ensures that the default is always
   presented in logical order.
8. The IRI Template itself MUST NOT contain bidi formatting characters.
   An implementation may wish to provide a modified "for display"
   version of the IRI Template with appropriate bidi formatting
   characters inserted into appropriate locations in the template to
   ensure proper rendering, but those control characters MUST be removed
   prior to processing the template.
9. A component SHOULD NOT use both LTR and RTL characters.
10. A component using RTL characters SHOULD start and end with RTL
    characters.

To illustrate the effect this has on the template, imagine the following
scenario.  Assume that capital letters are RTL.

I have a template whose logical ordering is:

  http://example.org?{-join|ABCD|EFGH=IJKL,MNOP=qrst}

(yes, I know it's unlikely that the join separator will be a string of
RTL characters but I'm doing this to illustrate a point)

Since the |, = and , characters are directionally neutral, without any
bidi formatting, when rendered the template will end up looking
something like:

  http://example.org?{-join|PONM,LKJI=HGFE|DCBA=qrst)

Which is obviously incorrect and confusing.  It can get even uglier if
the arg and default have a mix of LTR and RTL characters. By contrast,
with the bidi rules applied, the template is rendered as:

  http://example.org?{-join|ABCD|HGFE=IJKL,PONM=qrst}

Notice that the only characters displaying RTL are the varname's.  The
arg and default components, both of which are treated as literal values
to be inserted into the IRI are displayed in the same logical order in
which they are expected to be inserted into the IRI.

Also note that each of the components appear in the proper order in the
rendered template.  There is no confusion or ambiguity in the template.

Have I missed anything?

- James

Received on Sunday, 2 December 2007 05:25:55 UTC