[Bug 17967] New: Parsing algorithm should not preclude Complex Ruby

https://www.w3.org/Bugs/Public/show_bug.cgi?id=17967

           Summary: Parsing algorithm should not preclude Complex Ruby
           Product: HTML WG
           Version: unspecified
          Platform: Other
               URL: http://fantasai.inkedblade.net/weblog/2011/ruby/
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P3
         Component: other Hixie drafts (editor: Ian Hickson)
        AssignedTo: ian@hixie.ch
        ReportedBy: contributor@whatwg.org
         QAContact: contributor@whatwg.org
                CC: ian@hixie.ch, bzbarsky@mit.edu,
                    rubys@intertwingly.net, fantasai.bugs@inkedblade.net,
                    mike@w3.org, annevk@annevk.nl, public-i18n-cjk@w3.org,
                    kennyluck@w3.org, kojiishi@gluesoft.co.jp,
                    eoconnor@apple.com


This was was cloned from bug 13113 as part of operation convergence.
Originally filed: 2011-07-01 13:33:00 +0000
Original reporter: Henri Sivonen <hsivonen@iki.fi>

================================================================================
 #0   Henri Sivonen                                   2011-07-01 13:33:34 +0000 
--------------------------------------------------------------------------------
Continuing from bug 12935.

I have spent some more time implementing variations and experimenting
with them. I'm now ready to request specific edits.

Please make the following spec edits:

 1) Please add rb, rbc and rtc to the list of elements that get closed
by "generate implied end tags" at
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#generate-implied-end-tags

 2) Please replace the "in body" entry for 'A start tag whose tag name
is one of: "rp", "rt"' with these three entries:

A start tag whose tag name is one of: "rbc", "rtc"

    If the stack of open elements has a ruby element in scope, then
generate implied end tags.

    Insert an HTML element for the token.

A start tag whose tag name is one of: "rb"

    If the stack of open elements has a ruby element in scope, then
generate implied end tags, except for elements with the name "rbc".

    Insert an HTML element for the token.

A start tag whose tag name is one of: "rt", "rp"

    If the stack of open elements has a ruby element in scope, then
generate implied end tags, except for elements with the name "rtc".

    Insert an HTML element for the token.


Note that the "If the stack of open elements has a ruby element in
scope, then" parts are just copying the current spec text. I don't see
the value of that bit and would be OK with omitting the scope check.

Rationale:

We shouldn't paint ourselves in the corner with the parsing algorithm so
Complex Ruby can't be introduced in the future without causing ungraceful
behavior in browsers implementing an earlier snapshot of the parsing spec.

The changes proposed above assume a design where rp goes as a child of rtc (if
rp is used at all) in the Complex Ruby case. This allows UAs that implement
Simple Ruby be forward-compatible by having
   rp { display: none; }
   rtc > rt { display: inline; }
   rtc > rp { display: inline; }
in the UA style sheet while UAs supporting both Simple and Complex Ruby would
have
   rp { display: none; }
in the UA style sheet without the two other rules.
================================================================================
 #1   Ian 'Hixie' Hickson                             2011-07-01 22:20:54 +0000 
--------------------------------------------------------------------------------
This only makes sense if we think complex ruby makes sense. If it does not,
then we should design the parser to be the best thing ignoring complex ruby.

I'm not at all convinced that the use cases for complex ruby are compelling.
Sure, as with anything, there are use cases that need finer-grained semantics
than HTML can provide. But we're not designing DocBook here, the rare use cases
are _by design_ not handled. We don't have a way to semantically mark up
Scandanavian arroword crosswords (or indeed even simpler "regular" crosswords),
and that's ok. We don't have a way to mark up bibliographic entries in a manner
sufficiently semantic-rich to work as well as BibTeX, and that's ok.

Note that I'm not arguing here that we shouldn't add this _yet_; that it might
make sense one day but not today. I'm arguing that it will never make sense for
HTML to support complex ruby, because the use cases of complex ruby are too
obscure to deserve being supported as first-class primitives in HTML.

Am I wrong?

If I _am_ wrong, what other features might we one day add that we should
support in the parser today? Crosswords in particular might need particularly
painful changes to the table parsing model; should we add new elements to table
parsing rules to support potential future extensions there?
================================================================================
 #2   fantasai                                        2011-07-05 22:13:53 +0000 
--------------------------------------------------------------------------------
Yes, I think you are wrong.

  * *Most* ruby in Japanese should be marked up with Level 2 markup. This
    isn't a rare use case by any means.

  * Pretty much all of the semantics of BibTeX and DocBook can be captured
    extending HTML elements with a microformat. That's not the case for the
    structures of complex ruby.

  * Crossword puzzles and sudoku are handled better by table markup than most
    other games are handled by HTML markup, and for how common they are
compared
    to other use cases for HTML, strike me as adequately supported by HTML
    already.  But if you think it's insufficient, file a separate bug.

  * You don't know how the needs of HTML will evolve over time. All you can
    anticipate is what's appropriate for it to include right now. I'm sure that
    12 years ago, many of the things included in HTML5 would be considered
scope
    creep from what was supposed to be a simple document markup language.

The top complaint the CSSWG got from the publishing industry in Japan, btw, was
that the way ruby influences the line height is wrong. I think that's a fair
indicator that correct support for ruby is important to them as they move more
of their content to HTML.
================================================================================
 #3   fantasai                                        2011-07-05 22:21:40 +0000 
--------------------------------------------------------------------------------
[Sorry for the broken wrapping. I didn't realize the text box was bigger than
the line limit.]
================================================================================
 #4   Ian 'Hixie' Hickson                             2011-07-08 23:26:15 +0000 
--------------------------------------------------------------------------------
>   * *Most* ruby in Japanese should be marked up with Level 2 markup. This
>     isn't a rare use case by any means.

What data do we have on this? I find this hard to believe. All the examples
I've seen of complex ruby have seemed rather contrived.
================================================================================
 #5   fantasai                                        2011-07-18 02:56:53 +0000 
--------------------------------------------------------------------------------
> What data do we have on this?

The fact that most Japanese words are compound words with that structure? Level
1 markup only applies to
  a) single-kanji words--which are reasonably common but not overwhelmingly so
  b) multi-kanji words whose pronunciation cannot be broken down (which are
     very noticeably a small minority of such words)

> All the examples I've seen of complex ruby have seemed rather contrived.

The term "complex ruby" covers a lot of things and is imho a misleading
distinction. Let's use instead the levels I outlined in my writeup. To which
level(s) do the contrived examples you have seen belong and why do they seem
contrived?
================================================================================
 #6   Sam Ruby                                        2011-08-02 22:13:07 +0000 
--------------------------------------------------------------------------------
This bug was marked as P1 over 30 days ago, and still hasn't been RESOLVED.

Editor (and editor assistants): please RESOLVE it ASAP.  NEEDSINFO and WONTFIX
are valid resolutions for this part of the process.  We simply want to get this
bug to a state where we are prepared to accept change proposals should anybody
be inclined to produce such.
================================================================================
 #7   Anne                                            2011-08-03 05:41:51 +0000 
--------------------------------------------------------------------------------
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are
satisfied with this response, please change the state of this bug to CLOSED. If
you have additional information and would like the editor to reconsider, please
reopen this bug. If you would like to escalate the issue to the full HTML
Working Group, please add the TrackerRequest keyword to this bug, and suggest
title and text for the tracker issue; or you may create a tracker issue
yourself, if you are able to do so. For more details, see this document:
<http://dev.w3.org/html5/decision-policy/decision-policy.html>.

Status: Rejected
Change Description: no spec change
Rationale: Resolving as WONTFIX to address comment 6 and because comment 5 has
not really any convincing data.
================================================================================
 #8   Boris Zbarsky                                   2011-08-03 14:57:10 +0000 
--------------------------------------------------------------------------------
Reopening.  What data did you expect exactly?  A list of Japanese words or
phrases that can't be usefully marked up without <rb>?
================================================================================
 #9   Anne                                            2011-08-03 15:00:43 +0000 
--------------------------------------------------------------------------------
<rb> is implicit in the current model and has nothing to do with complex ruby.
================================================================================
 #10  Boris Zbarsky                                   2011-08-03 15:50:41 +0000 
--------------------------------------------------------------------------------
Making it implicit makes fallback and inlining not work.  Did you even read the
document linked to in the url field?  Did you read the second part of comment
5?
================================================================================
 #11  Anne                                            2011-08-03 16:07:46 +0000 
--------------------------------------------------------------------------------
My bad, last time I spoke to Japanese developers what IE had was sufficient and
I assumed nothing much had changed. Having said that, I'm not sure allowing UAs
to not support ruby markup makes sense and sort of wonder how often one would
use ruby markup to then have it inlined.
================================================================================
 #13  Ian 'Hixie' Hickson                             2011-08-04 06:57:21 +0000 
--------------------------------------------------------------------------------
Status: Did Not Understand Request
Change Description: no spec change
Rationale: I still haven't seen data on this. Make a random selection of books,
magazines, web pages, or whatever, and tabulate how many of each kind of ruby
these texts have, ideally with examples of each for my own education. It's
possible that what's in the spec is insufficient, but I am highly skeptical
that the level of complexity being proposed is necessary to solve read-world
use cases.
================================================================================
 #14  Ian 'Hixie' Hickson                             2011-08-06 03:33:49 +0000 
--------------------------------------------------------------------------------
*** Bug 10830 has been marked as a duplicate of this bug. ***
================================================================================
 #15  fantasai                                        2011-10-07 23:53:22 +0000 
--------------------------------------------------------------------------------
Since I neither have access to a Japanese library, nor the time and patience
necessary to tabulate the kind of data set you're requesting, you're getting
the next best thing: scans from a magazine lent me by someone I randomly met on
the BART. The magazine is Mangajin issue 53, published March 1995, and the
tagline is "Japanese Pop Culture & Language Learning". Here are two
representative pages and diagrammed extracts from them.

Several articles furigana over the kanji. Example:
http://fantasai.inkedblade.net/weblog/2011/ruby/mangajin-54
They are formatted using jukugo ruby. (Jukugo ruby formats like a word-to-word
association, but line-breaks differently: the associated kana must be kept wih
their kanji base.) This colorized extract shows the association of kana to
kanji:
http://fantasai.inkedblade.net/weblog/2011/ruby/mangajin-jukugo-ruby
The ratio of compound words to simple words is 2:1. The rest of the page holds
close to this ratio.

Other parts of the magazine use double-annotated ruby. Example:
http://fantasai.inkedblade.net/weblog/2011/ruby/mangajin-35
Notice the line-breaking behavior and the word associations.
Here is a diagrammed exerpt. The ruby base is in red. The first annotation
(romaji) is in blue. The second annotation (English transliteration) is green:
http://fantasai.inkedblade.net/weblog/2011/ruby/mangajin-double-annotation

Here is real-world use of complex ruby. You can of course continue to argue
that the use case is unimportant, but it exists.
================================================================================
 #16  Ian 'Hixie' Hickson                             2011-10-10 19:44:22 +0000 
--------------------------------------------------------------------------------
The first one does not seem to require anything the spec doesn't already
provide.

The second is consistent with what I wrote in comment 1. I make no argument
that there are no use cases. My argument is that the use cases are obscure. The
second example here is not common text, it's a very specialised case where the
language itself is being taught.

There are lots of examples of how we don't currently support that kind of
thing. For example, we don't have markup for grammar annotation (no <verb>,
<subject>, <adverbial-clause> elements) which would be very useful for people
teaching French of English. We don't have anything for marking up family trees
or molecular structures, even though that means HTML is deficient for
supporting those use cases (I get at least one person who asks me whether we
can add markup for genealogy every few months, because right now they're stuck
with using bitmaps or abusing tables to convey their data, and that sucks). I
gave other examples in comment 1.
================================================================================
 #17  Ian 'Hixie' Hickson                             2011-11-02 19:30:21 +0000 
--------------------------------------------------------------------------------
I spoke with the i18n group about this yesterday, and it seems that we don't
really need to add any elements to handle the important use cases here.

Multiple annotations can be handled pretty easily if we just define that nested
ruby is semantically equivalent to two annotations; picking which side the
annotations appear on is a stylistic issue for CSS. Monoruby and group ruby are
both handled already; the only difference is the how much is put in the ruby
base before the annotation. Jukugo is a stylistic variant of group ruby, again
to be handled in CSS.

Fallback if we rely on this simple pattern is suboptimal, but that doesn't seem
to be a big deal. It's time for implementations to just implement ruby. AT
fallback is not impossible in any of these cases and is unaffected by how we
mark it up.

The last remaining case is what to do with multiple annotation if there is
word-pairing for each component. Not supporting this doesn't seem like a big
deal, but if we do want to support it, we could do it with multiple <rt>s for
each ruby base.

In conclusion, the spec should be changed to limit ruby nesting to two levels,
defining the outer level as a phrase-level annotation; and we should consider
supporting multiple <rt>s per base, if there is reason to believe that multiple
monoruby annotations at the ends of lines are common.
================================================================================
 #18  Michael[tm] Smith                               2011-11-20 17:26:17 +0000 
--------------------------------------------------------------------------------
Henri, any response to comment #17 from Hixie?
================================================================================
 #19  Henri Sivonen                                   2011-11-21 07:36:25 +0000 
--------------------------------------------------------------------------------
I'm not competent to disagree with the i18n group on this topic. fantasai, bz?
================================================================================
 #20  fantasai                                        2011-11-29 00:49:46 +0000 
--------------------------------------------------------------------------------
> Jukugo is a stylistic variant of group ruby,

This is not true.

> again to be handled in CSS.

CSS can handle jukugo vs. mono rendering at the stylistic level *iff* both the
pairing and the word-boundary information is recorded in the HTML. Group ruby
doesn't record any sub-word pairing information, because there isn't any, so
you'll have to explain better what you mean by this sentence.

> defining the outer level as a phrase-level annotation

This doesn't make sense for, e.g. double annotating kanji with both kana and
romaji.

>  we should consider supporting multiple <rt>s per base, if there is reason to
> believe that multiple monoruby annotations at the ends of lines are common.

I have no idea what this is referring to.
================================================================================
 #21  Koji Ishii                                      2011-12-04 05:57:25 +0000 
--------------------------------------------------------------------------------
> Fallback if we rely on this simple pattern is suboptimal, but that doesn't seem
> to be a big deal. It's time for implementations to just implement ruby. AT
> fallback is not impossible in any of these cases and is unaffected by how we
> mark it up.

Fallback isn't only for browsers without ruby support. One UA vendor I know
considered using fallback when ruby is too small to read, but gave up due to
the text quality issue fantasai pointed out.
================================================================================
 #22  fantasai                                        2012-02-20 17:18:42 +0000 
--------------------------------------------------------------------------------
http://lists.w3.org/Archives/Public/public-i18n-cjk/2012JanMar/0063.html
================================================================================

-- 
Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

Received on Wednesday, 18 July 2012 07:25:13 UTC