Re: Summary of I18N discussion in HTML WG today from fantasai on 2012-11-09 (www-international@w3.org from October to December 2012)

From: fantasai <fantasai.lists@inkedblade.net>
Date: Fri, 09 Nov 2012 10:16:34 -0800
To: Richard Ishida <ishida@w3.org>
CC: Bruce Lawson <brucel@opera.com>, public-html@w3.org, www International <www-international@w3.org>
Message-ID: <509D4882.8060308@inkedblade.net>
On 11/09/2012 07:42 AM, Richard Ishida wrote:
> On 05/11/2012 14:09, Bruce Lawson wrote:
>> On Fri, 02 Nov 2012 10:45:36 -0000, Robin Berjon <robin@w3.org> wrote:
>>
>>>
>>> ### Forward-looking ruby model
>>> Fantasai exposed a set of issues with the current ruby markup that
>>> make it awkward to extend in future for features that we have good
>>> reasons to believe will become increasingly common as HTML is used for
>>> books, scientific publishing, and pretty much everything in the world
>>> in general. These involve jukugo ruby, fallback, double-sided ruby.
>>
>> is this set of issues written up anywhere?
>
>
> Bruce, see http://www.w3.org/TR/ruby-use-cases/.  Fantasai also wrote
> something in a blog post that I tried to represent in the aforementioned doc.

Here's the blog post:
   http://fantasai.inkedblade.net/weblog/2011/ruby/

A key point that's not in the blog post is that there are two fundamentally
different models for doing ruby:

   row-based model
     This is the XHTML Ruby approach, where all the base text is given,
     followed by all the annotations, row by row.

   column-based model
     This is the HTML Ruby approach, where each base is given followed
     immediately by its annotation(s), column by column.

The column-based model has several flaws:

   1. It doesn't handle inlining gracefully. As an example, the word
      Tokyo is written 東京 in kanji and とうきょう in kana. The base-text
      pairs are 東-とう 京-きょう, and the ruby markup must create those
      associations accordingly. However, when rendered inline, the
      correct rendering is
        東京(とうきょう)
      with the word kept together as one unit, not
        東(とう)京(きょう)

      There are various use cases for inlining:
        * fallback, for implementations that don't support ruby.
        * compacting the layout, because ruby requires higher inter-line
          spacing. (If ruby is rare enough in the document, it's more
          efficient to present it inline, and this has been a desired
          option on phones.)
        * small fonts -- in order to fit above the base text, ruby is
          typically written about half as small as the base text. If
          the base font size is too small it can become unreadable,
          especially for older people. Inlined annotations on the other
          hand are the same size as the base text.

      The author and the UA should have the choice of proper inlining
      without changing the markup. Doing that with the current markup
      requires special box-reordering support in the layout engine,
      which is doable but not trivial and certainly does not solve the
      fallback use case.

   2. It doesn't handle spanning gracefully, i.e. the case where there
      are multiple annotations and their boundaries don't line up.
      See http://fantasai.inkedblade.net/weblog/2011/ruby/#double for
      examples.

      Hixie recently added the ability to do two types of double-sided
      ruby to try to address this use case, but used completely different
      markup models: one case would be done with nested <ruby> tags, and
      the other with multiple adjacent <rt> elements. The problem with
      this is that
        * it forces the author to learn (and style) two very different
          markup models for things that are fundamentally the same
        * it forces the UA to implement two very different layout models
          for things that are fundamentally the same

      One of the complexities of ruby layout that is overlooked is that
      adjacent ruby on a single line need to negotiate space from each
      other. In the simple case, they are black boxes of a particular
      size: if the annotation text is wider than the base text, the
      inline is treated as having the size of its annotation. But this
      is not always the desired rendering. In many cases it's desired
      for a long annotation to overhang adjacent text *if that text is
      not itself annotated* and there is therefore sufficient room for
      the overhang. So inline layout needs to negotiate space for
      annotations among ruby structures on the same line, across inline
      element boundaries, etc.

      Another of course is negotiating line-breaks within the ruby among
      the base text and its annotations.

      So not only does this approach require the author to learn two
      different models, it also requires the layout engine to implement
      two different models and handle their interactions.

      Personally, I don't see why we are insisting on this approach when
      there is a sensible alternative that puts all forms of ruby on the
      same track and allows for whatever extensions we might want from
      now through 2025 to be handled within the same basic architecture.

Note, I'm not advocating that the current model for single-sided ruby,
which is implemented in WebKit and Trident already, should be abandoned.
It's fairly easy to incorporate that into a box model that extends it
into a row-based system. I'm saying we shouldn't shoehorn additional
requirements into that model as hixie has done, dropping some of them
on the floor as necessary, but instead extend in the direction of a
model that satisfies the all requirements with a single unified model.
I think this is less complex and more satisfying than the current
approach.

~fantasai
Received on Friday, 9 November 2012 18:17:04 UTC