Ruby feedback from Ian Hickson on 2008-12-30 (public-html@w3.org from December 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Tue, 30 Dec 2008 05:24:21 +0000 (UTC)
To: Brian Smith <brian@briansmith.org>, Justin James <j_james@mindspring.com>, Masataka Yakura <yakura-masataka@mitsue.co.jp>, Jens Meiert <jens@meiert.com>, ddailey <ddailey@zoominternet.net>, w@suika.fam.cx
Cc: 'HTML WG' <public-html@w3.org>
Message-ID: <Pine.LNX.4.62.0812300458160.24109@hixie.dreamhostps.com>
On Mon, 26 May 2008, Brian Smith wrote:
> Anne van Kesteren wrote:
> > http://www.whatwg.org/specs/web-apps/current-work/multipage/section-text-level.html#the-ruby
> 
> This is just an incomplete subset of the long-existing Ruby Annotation 
> recommendation. In particular, the complex ruby markup (RB, RBC, and 
> RTC) was left out in the HTML 5 version. At least the spanning mechanism 
> of complex ruby is needed for some common situations in Japanese. I 
> believe the above-and-below (left and right) ruby annotations are useful 
> for Japanese language educators as the bottom annotation can be used for 
> romanized transliteration.

The subset is basically "what IE implemented". I can't really see a good 
way to support the advanced ruby markup in a manner that is still somewhat 
compatible with legacy markup.


> Also, the example in the HTML 5 draft is bad.

I've replaced it with three better examples.


> In particular, it is misleading because it suggests that <rt> elements 
> should be interleaved within the characters of the words they are 
> annotating.

Why is this wrong?


> Let me know if you are in need of real-world Ruby examples for Japanese, 
> as my friend is a Japanese instructor and she uses Ruby annotations on a 
> daily basis in Microsoft Office. I can translate her MS Office examples 
> into proper XHTML for you.

That would be very useful, yes.


> Also note that Chinese uses Ruby markup differently than Japanese, so a 
> Chinese-language example would be a good idea as well.

I've added a couple of Chinese examples, but I don't really see how they 
are different. Are the examples wrong?


On Mon, 26 May 2008, Justin James wrote:
> 
> It would also be good for the write up to explain:
> 
> * What "ruby" *is*

I've added some extra explanatory text.


> * Why someone would use it
> * Where to get more information
>
> I spent 5 minutes looking at this wondering how this applied to the Ruby 
> programming language, and why there was not a <perl>, <c++>, or <vb.net> 
> tag set as well. It was not until I read this message from Brian that I 
> had any idea what this was talking about.

It should be clearer now, and should make searching for more information 
relatively easy.


On Tue, 27 May 2008, Masataka Yakura wrote:
> Brian Smith wrote:
> > Also, the example in the HTML 5 draft is bad. In particular, it is
> > misleading because it suggests that <rt> elements should be interleaved
> > within the characters of the words they are annotating. The proper markup is
> > either:
> > 
> > <ruby>斎<rt><rp>(</rp>さい</rt><rp>)</rp></ruby>
> > <ruby>藤<rt><rp>(</rp>とう</rt><rp>)</rp></ruby>
> > <ruby>信<rt><rp>(</rp>のぶ</rt><rp>)</rp></ruby>
> > <ruby>男<rt><rp>(</rp>お</rt><rp>)</rp></ruby>
> 
> This shouldn't be. Those letters form a name "斎藤信男" ("斎藤" is a
> family name and "信男" is a given name). Marking it up with such multiple
> <ruby>s breaks the name into meaningless letters.
> 
> It will look awfully in browsers which does not support <ruby> or when copying
> and pasting. I'm also afraid that screen readers or voice browsers cannot read
> it out properly.

Are the new examples ok?


On Wed, 28 May 2008, Masataka Yakura wrote:
> 
> If we can define an algorithm which determines the base text, it'll be 
> nice because then we can make <rb> optional, and thus being compatible 
> with the current IE implementation.

<rb> isn't in HTML5 at all at this point.


On Tue, 27 May 2008, Jens Meiert wrote:
>
> CMIIW, but at least the "rb" element [1] seems to be missing for simple 
> Ruby markup (with parentheses [2]).

It's not needed, since the base can just be derived from the contents of 
the <ruby> element. (This is compatible with what IE does -- it basically 
just ignored the <rb> element.)


> Also, what are the reasons for not including complex Ruby markup [3] as 
> well?

It isn't clear that that level of complexity is needed, and no mainstream 
browser supports it out of the box yet.


On Sat, 5 Jul 2008 w@suika.fam.cx wrote:
> 
> * <ruby> should close a ruby element in scope
> 
> In the current HTML5 parsing algorithm,
> 
>   <ruby id=x><ruby id=y>
> 
> ... will result in a DOM tree where #y is a child of #x.  However, for 
> the compatibility with WinIE, <ruby> should close any ruby elements in 
> scope.  The example above should create a DOM tree where #y is a sibling 
> of #x. Without this behavior, a Web page will not be rendered correctly.

I studied IE's parsing behavior here quite closely, as well as doing some 
pretty detailed studies of existing markup on the Web, to get the current 
rules. Getting actual compatibility with IE's exact behavior is impossible 
while maintaining a tree DOM. I think what we have now is the closest to 
what IE does that makes sense and doesn't preclude features like nested 
<ruby> annotations.


> * ruby should not be allowed in ruby base
> 
> In the current spec, a ruby element may be inserted in the "ruby base" 
> part of another ruby element.  Such constructs are meaningless and 
> incompatible with the proposed parsing rule fix mentioned above.  The 
> content model for the ruby element should disallow ruby descendants in 
> the ruby base part.
> 
> I don't know whether ruby in rt should be disallowed as well.  ruby in 
> ruby text is possible but very, very rare.

I don't really follow why it should be disallowed. I agree that it isn't 
especially useful currently.


> * <rt> should not close span element
> 
> In the current HTML5 parsing algorithm,
> 
>   <ruby>xxx<span><rt>
> 
> ... will be parsed as a ruby element node whose children are "xxx" text 
> node and span element node followed by rt element node (with a parse 
> error for the missing of </span>).  However, it will break Web pages 
> generated by Microsoft Excel.  Microsoft Excel, when a user wants to 
> export his spreadsheet as an HTML document, generates an HTML fragement 
> like:
> 
>   <ruby>visible (possibly Kanji) text<span style='display: none'><rt>hidden
>   input (Kana) text</rt></span></ruby>
> 
> ... if the user has configured not to render ruby texts.  (At least in 
> Japanese version, Microsoft Excel saves pronunciations of the user input 
> texts in XLS and HTML files even when the user doesn't configure to show 
> them as ruby texts.)  With the current HTML5 parser, the span element is 
> closed before the <rt> start tag and therefore ruby texts are not hidden 
> by the "display: none" specification to the span element.

The problem is that if we _don't_ close the <span> then we end up making 
the rendering of <ruby> _dramatically_ more complicated.


> Examples of Excel-generated HTML documents with hidden ruby texts
> in the wild:
>   <http://sendai.cool.ne.jp/miyagiswim_jhs/H18sendai_result.htm>
>   <http://www.excel-hp.com/>
>   <http://ojt0001.fc2web.com/kimkim/excell/web1.htm>

Luckily it seems that Excel also includes:

   rt { display: none }

...in the style sheet, so this is not a critical problem.


> Another example is:
>   <http://www.yasalambellydance.com/2007/05/schedule.html>

I agree that this page would be affected.


> * note on rt rendering
> 
> Since there are a number of documents with ruby but without rp, when you 
> write the rendering section, please include an advice for user agents 
> that do not support "correct" ruby rendering to render something like 
> "(" and ")" before and after ruby text (using, e.g., CSS ::before and 
> ::after) even when there are no rp elements. Otherwise, reading Web 
> pages without rp is very annoying.

Noted.


> * feature request: secondary ruby text
> 
> SUMMARY of this section: Please add features to:
>   - markup both primary and secondary ruby texts, and
>   - markup only the secondary ruby text.
> 
> The current spec lacks a feature to associate two ruby texts with a ruby 
> base. In the W3C Ruby Recommendation, that feature is supported as part 
> of the complex ruby markup feature.  However, for most use cases of two 
> (both-side) ruby texts (listed later), the complex ruby markup is too 
> complex.
> 
> I wonder if the current HTML5 ruby syntax can be simply extended such
> that it can associate two ruby texts with a ruby base, as:
> 
>   <ruby>ruby base<rp> (</rp><rt>first ruby text</rt><rp> /
>   </rp><rt>second ruby text</rt><rp>) </rp></ruby>
> 
>   NOTE: I use this proposed syntax for the examples in this section.
> 
> This extension is compatible with the four browsers in the sense that
> with only a chunk of CSS2 styling rules (e.g. position: relative) it is
> possible to render two-ruby texts approximately.  Authors can use
> the technique until browsers implement CSS3 ruby properties with
> this markup.  See [2] for demo.
> 
>   [2] <http://suika.fam.cx/~wakaba/-temp/test/html/ruby/styling/relative/above-below-2.html>
> 
> Use cases for two (both-side) ruby texts include:
> 
>   a) Alternative pronunciation.  A Kanji word sometimes has two
>      possible pronunciation and an author might want to show them
>      above and below the base text.  Example [3]:
> 
>        <ruby>[SADA/TEI] [IE/KA]<rt>sadaie</rt><rt>teika</rt></ruby>
> 
>      where "[XXX]" is a Kanji character and ruby texts are two pronunciations
>      of the Kanji word (a human name).
> 
>      [3] <http://www.pref.wakayama.lg.jp/prefg/500200/19shakai_ken_mondai.pdf>
>          Page 3, right-hand side of line 7.
> 
>   b) Foreign language representation.  In a technical document, the author might
>      want to show both the Japanese pronunciation and the original English term
>      of a Japanese technical term.  Example [4]:
> 
>        <ruby>[SEI] [KEI] [SHIKI]<rt>seikeishiki</rt><rt
> lang=en>well-formed</rt></ruby>
> 
>      where "[XXX]" is a Kanji character and "seikeishiki" is the
> pronunciation of the
>      ruby base.  The ruby base and the primary ruby text represents the Japanese
>      translation of the term "well-formed".
> 
>      [4] <http://suika.fam.cx/gate/2005/sw/%E6%95%B4%E5%BD%A2%E5%BC%8F>
>        (Disclaimer: part of our wiki)
> 
>   c) Short annotation for e.g. years of birth and death of a person
> described by the ruby
>      base (and the primary ruby text).  This is a common style in
> history books.  Example:
> 
>        <ruby>[TOKU] [GAWA] [IE] [YASU]<rt>tokugawa
> ieyasu</rt><rt>1543-1616</rt></ruby>
> 
>      where "tokugawa ieyasu" is the pronunciation of the ruby base (a
> human name)
>      and the secondary ruby represents that he was born in 1543 and
> dead in 1616.
> 
>   d) The pronunciation for the second-time reading in Kambun annotation system.
>       In the "kundoku"*1 annotation system used for Kambun (classic
> Chinese) text,
>       when a character is read out two times ("saidoku moji" =
> re-reading character),
>       the pronunciation for the first time is encoded as the primary
> (right-side in
>       vertical text) ruby text and the pronunciation for the second
> time is encoded
>       as the secondary (left-side) ruby text.  See [5] for examples.
> 
>       *1 With the "kundoku" annotations, classsic Chinese texts can be magically
>          understood as classic Japanese texts.
>       [5] <http://www.daito.ac.jp/~oukodou/kuzukago/kundoku.html#6>
>           In this document, ruby texts are marked up by <font size></font>.
>           Where two ruby texts should be associated with a ruby base,
>           a KATAKANA MIDDLE DOT character in <font size></font> is used
>           to separate primary and secondary ruby texts.
> 
> For any of these use cases, a simple extension to the simple ruby markup
> is enough and the complex ruby markup is not necessary in most situations,
> in my humble opinion.  (I don't think the complex ruby markup (rbspan=""
> feature, in particular) is useless, but that is another story.)
> 
> In use cases b) and c), it is sometimes desired to show only the secondary
> ruby text since the pronunciation is trivial.  For example, a variation of
> case b) [6] would be:
> 
>   <ruby>[HAN] [PUKU]<rt>hanpuku</rt><rt>repetition</rt></ruby
>   ><ruby>burokku<rt>block</rt></ruby>
> 
> This example represents a WF2 term "repetition block".  In Japanese,
> "repetition" is represented by a Kanji word "[HAN] [PUKU]" associated
> with Kana reading of "hanpuku".  The word "block" is represented by a
> Kana word "burokku" with no additional pronunciation information.  The
> typical (and desired) rendering of this fragment would be:
> 
> ruby text 1:      hanpuku
> ruby base  :   [HAN] [PUKU]    burokku
> ruby text 2:    repetition      block
> 
> ... and "block" should not be rendered above the ruby base.  Though this
> can be styled by CSS3 ruby-position property, this is not entirely
> presentational
> matter --- secondary ruby text has different semantics from the primary
> ruby text and it should be identified in the markup level.  class="" is
> inappropriate since it cannot convey any semantics reliably.
> 
>     [6] <http://suika.fam.cx/gate/2005/sw/%E5%8F%8D%E5%BE%A9%E3%83%96%E3%83%AD%E3%83%83%E3%82%AF>
>        (Disclaimer: part of our wiki, again)
> 
> My proposals for the secondary-only ruby text markup are:
> 
>   A. Introducing a new boolean attribute: <ruby>base<rt secondary>text</ruby>
>       (With recommended UA style sheet fragment: rt[secondary] {
> ruby-position: after }.)
>   B. Allowing empty rt element for this purpose: <ruby>base<rt><rt>text</ruby>
> 
> # I don't know if "secondary" is an appropriate word when there is no
> # "primary" ruby text...
> 
> Although it would be possible to add type="" attribute to rt element so that
> the semantics of ruby texts can be clearly identified (e.g. type=pronunciation,
> type=translation, and so on), I think it is overengineered and a simple
> solution like A. or B. is desired.

I haven't added this feature, partly because I'm not convinced it is that 
important, but mostly because no browser supports it, and most browsers 
don't even support simply <ruby> yet. I think it is something we should 
consider once <ruby> is more widely implemented.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Tuesday, 30 December 2008 05:25:03 UTC