- From: <bugzilla@jessica.w3.org>
- Date: Wed, 18 Jul 2012 07:25:06 +0000
- To: public-i18n-cjk@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=17967 Summary: Parsing algorithm should not preclude Complex Ruby Product: HTML WG Version: unspecified Platform: Other URL: http://fantasai.inkedblade.net/weblog/2011/ruby/ OS/Version: other Status: NEW Severity: normal Priority: P3 Component: other Hixie drafts (editor: Ian Hickson) AssignedTo: ian@hixie.ch ReportedBy: contributor@whatwg.org QAContact: contributor@whatwg.org CC: ian@hixie.ch, bzbarsky@mit.edu, rubys@intertwingly.net, fantasai.bugs@inkedblade.net, mike@w3.org, annevk@annevk.nl, public-i18n-cjk@w3.org, kennyluck@w3.org, kojiishi@gluesoft.co.jp, eoconnor@apple.com This was was cloned from bug 13113 as part of operation convergence. Originally filed: 2011-07-01 13:33:00 +0000 Original reporter: Henri Sivonen <hsivonen@iki.fi> ================================================================================ #0 Henri Sivonen 2011-07-01 13:33:34 +0000 -------------------------------------------------------------------------------- Continuing from bug 12935. I have spent some more time implementing variations and experimenting with them. I'm now ready to request specific edits. Please make the following spec edits: 1) Please add rb, rbc and rtc to the list of elements that get closed by "generate implied end tags" at http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#generate-implied-end-tags 2) Please replace the "in body" entry for 'A start tag whose tag name is one of: "rp", "rt"' with these three entries: A start tag whose tag name is one of: "rbc", "rtc" If the stack of open elements has a ruby element in scope, then generate implied end tags. Insert an HTML element for the token. A start tag whose tag name is one of: "rb" If the stack of open elements has a ruby element in scope, then generate implied end tags, except for elements with the name "rbc". Insert an HTML element for the token. A start tag whose tag name is one of: "rt", "rp" If the stack of open elements has a ruby element in scope, then generate implied end tags, except for elements with the name "rtc". Insert an HTML element for the token. Note that the "If the stack of open elements has a ruby element in scope, then" parts are just copying the current spec text. I don't see the value of that bit and would be OK with omitting the scope check. Rationale: We shouldn't paint ourselves in the corner with the parsing algorithm so Complex Ruby can't be introduced in the future without causing ungraceful behavior in browsers implementing an earlier snapshot of the parsing spec. The changes proposed above assume a design where rp goes as a child of rtc (if rp is used at all) in the Complex Ruby case. This allows UAs that implement Simple Ruby be forward-compatible by having rp { display: none; } rtc > rt { display: inline; } rtc > rp { display: inline; } in the UA style sheet while UAs supporting both Simple and Complex Ruby would have rp { display: none; } in the UA style sheet without the two other rules. ================================================================================ #1 Ian 'Hixie' Hickson 2011-07-01 22:20:54 +0000 -------------------------------------------------------------------------------- This only makes sense if we think complex ruby makes sense. If it does not, then we should design the parser to be the best thing ignoring complex ruby. I'm not at all convinced that the use cases for complex ruby are compelling. Sure, as with anything, there are use cases that need finer-grained semantics than HTML can provide. But we're not designing DocBook here, the rare use cases are _by design_ not handled. We don't have a way to semantically mark up Scandanavian arroword crosswords (or indeed even simpler "regular" crosswords), and that's ok. We don't have a way to mark up bibliographic entries in a manner sufficiently semantic-rich to work as well as BibTeX, and that's ok. Note that I'm not arguing here that we shouldn't add this _yet_; that it might make sense one day but not today. I'm arguing that it will never make sense for HTML to support complex ruby, because the use cases of complex ruby are too obscure to deserve being supported as first-class primitives in HTML. Am I wrong? If I _am_ wrong, what other features might we one day add that we should support in the parser today? Crosswords in particular might need particularly painful changes to the table parsing model; should we add new elements to table parsing rules to support potential future extensions there? ================================================================================ #2 fantasai 2011-07-05 22:13:53 +0000 -------------------------------------------------------------------------------- Yes, I think you are wrong. * *Most* ruby in Japanese should be marked up with Level 2 markup. This isn't a rare use case by any means. * Pretty much all of the semantics of BibTeX and DocBook can be captured extending HTML elements with a microformat. That's not the case for the structures of complex ruby. * Crossword puzzles and sudoku are handled better by table markup than most other games are handled by HTML markup, and for how common they are compared to other use cases for HTML, strike me as adequately supported by HTML already. But if you think it's insufficient, file a separate bug. * You don't know how the needs of HTML will evolve over time. All you can anticipate is what's appropriate for it to include right now. I'm sure that 12 years ago, many of the things included in HTML5 would be considered scope creep from what was supposed to be a simple document markup language. The top complaint the CSSWG got from the publishing industry in Japan, btw, was that the way ruby influences the line height is wrong. I think that's a fair indicator that correct support for ruby is important to them as they move more of their content to HTML. ================================================================================ #3 fantasai 2011-07-05 22:21:40 +0000 -------------------------------------------------------------------------------- [Sorry for the broken wrapping. I didn't realize the text box was bigger than the line limit.] ================================================================================ #4 Ian 'Hixie' Hickson 2011-07-08 23:26:15 +0000 -------------------------------------------------------------------------------- > * *Most* ruby in Japanese should be marked up with Level 2 markup. This > isn't a rare use case by any means. What data do we have on this? I find this hard to believe. All the examples I've seen of complex ruby have seemed rather contrived. ================================================================================ #5 fantasai 2011-07-18 02:56:53 +0000 -------------------------------------------------------------------------------- > What data do we have on this? The fact that most Japanese words are compound words with that structure? Level 1 markup only applies to a) single-kanji words--which are reasonably common but not overwhelmingly so b) multi-kanji words whose pronunciation cannot be broken down (which are very noticeably a small minority of such words) > All the examples I've seen of complex ruby have seemed rather contrived. The term "complex ruby" covers a lot of things and is imho a misleading distinction. Let's use instead the levels I outlined in my writeup. To which level(s) do the contrived examples you have seen belong and why do they seem contrived? ================================================================================ #6 Sam Ruby 2011-08-02 22:13:07 +0000 -------------------------------------------------------------------------------- This bug was marked as P1 over 30 days ago, and still hasn't been RESOLVED. Editor (and editor assistants): please RESOLVE it ASAP. NEEDSINFO and WONTFIX are valid resolutions for this part of the process. We simply want to get this bug to a state where we are prepared to accept change proposals should anybody be inclined to produce such. ================================================================================ #7 Anne 2011-08-03 05:41:51 +0000 -------------------------------------------------------------------------------- EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document: <http://dev.w3.org/html5/decision-policy/decision-policy.html>. Status: Rejected Change Description: no spec change Rationale: Resolving as WONTFIX to address comment 6 and because comment 5 has not really any convincing data. ================================================================================ #8 Boris Zbarsky 2011-08-03 14:57:10 +0000 -------------------------------------------------------------------------------- Reopening. What data did you expect exactly? A list of Japanese words or phrases that can't be usefully marked up without <rb>? ================================================================================ #9 Anne 2011-08-03 15:00:43 +0000 -------------------------------------------------------------------------------- <rb> is implicit in the current model and has nothing to do with complex ruby. ================================================================================ #10 Boris Zbarsky 2011-08-03 15:50:41 +0000 -------------------------------------------------------------------------------- Making it implicit makes fallback and inlining not work. Did you even read the document linked to in the url field? Did you read the second part of comment 5? ================================================================================ #11 Anne 2011-08-03 16:07:46 +0000 -------------------------------------------------------------------------------- My bad, last time I spoke to Japanese developers what IE had was sufficient and I assumed nothing much had changed. Having said that, I'm not sure allowing UAs to not support ruby markup makes sense and sort of wonder how often one would use ruby markup to then have it inlined. ================================================================================ #13 Ian 'Hixie' Hickson 2011-08-04 06:57:21 +0000 -------------------------------------------------------------------------------- Status: Did Not Understand Request Change Description: no spec change Rationale: I still haven't seen data on this. Make a random selection of books, magazines, web pages, or whatever, and tabulate how many of each kind of ruby these texts have, ideally with examples of each for my own education. It's possible that what's in the spec is insufficient, but I am highly skeptical that the level of complexity being proposed is necessary to solve read-world use cases. ================================================================================ #14 Ian 'Hixie' Hickson 2011-08-06 03:33:49 +0000 -------------------------------------------------------------------------------- *** Bug 10830 has been marked as a duplicate of this bug. *** ================================================================================ #15 fantasai 2011-10-07 23:53:22 +0000 -------------------------------------------------------------------------------- Since I neither have access to a Japanese library, nor the time and patience necessary to tabulate the kind of data set you're requesting, you're getting the next best thing: scans from a magazine lent me by someone I randomly met on the BART. The magazine is Mangajin issue 53, published March 1995, and the tagline is "Japanese Pop Culture & Language Learning". Here are two representative pages and diagrammed extracts from them. Several articles furigana over the kanji. Example: http://fantasai.inkedblade.net/weblog/2011/ruby/mangajin-54 They are formatted using jukugo ruby. (Jukugo ruby formats like a word-to-word association, but line-breaks differently: the associated kana must be kept wih their kanji base.) This colorized extract shows the association of kana to kanji: http://fantasai.inkedblade.net/weblog/2011/ruby/mangajin-jukugo-ruby The ratio of compound words to simple words is 2:1. The rest of the page holds close to this ratio. Other parts of the magazine use double-annotated ruby. Example: http://fantasai.inkedblade.net/weblog/2011/ruby/mangajin-35 Notice the line-breaking behavior and the word associations. Here is a diagrammed exerpt. The ruby base is in red. The first annotation (romaji) is in blue. The second annotation (English transliteration) is green: http://fantasai.inkedblade.net/weblog/2011/ruby/mangajin-double-annotation Here is real-world use of complex ruby. You can of course continue to argue that the use case is unimportant, but it exists. ================================================================================ #16 Ian 'Hixie' Hickson 2011-10-10 19:44:22 +0000 -------------------------------------------------------------------------------- The first one does not seem to require anything the spec doesn't already provide. The second is consistent with what I wrote in comment 1. I make no argument that there are no use cases. My argument is that the use cases are obscure. The second example here is not common text, it's a very specialised case where the language itself is being taught. There are lots of examples of how we don't currently support that kind of thing. For example, we don't have markup for grammar annotation (no <verb>, <subject>, <adverbial-clause> elements) which would be very useful for people teaching French of English. We don't have anything for marking up family trees or molecular structures, even though that means HTML is deficient for supporting those use cases (I get at least one person who asks me whether we can add markup for genealogy every few months, because right now they're stuck with using bitmaps or abusing tables to convey their data, and that sucks). I gave other examples in comment 1. ================================================================================ #17 Ian 'Hixie' Hickson 2011-11-02 19:30:21 +0000 -------------------------------------------------------------------------------- I spoke with the i18n group about this yesterday, and it seems that we don't really need to add any elements to handle the important use cases here. Multiple annotations can be handled pretty easily if we just define that nested ruby is semantically equivalent to two annotations; picking which side the annotations appear on is a stylistic issue for CSS. Monoruby and group ruby are both handled already; the only difference is the how much is put in the ruby base before the annotation. Jukugo is a stylistic variant of group ruby, again to be handled in CSS. Fallback if we rely on this simple pattern is suboptimal, but that doesn't seem to be a big deal. It's time for implementations to just implement ruby. AT fallback is not impossible in any of these cases and is unaffected by how we mark it up. The last remaining case is what to do with multiple annotation if there is word-pairing for each component. Not supporting this doesn't seem like a big deal, but if we do want to support it, we could do it with multiple <rt>s for each ruby base. In conclusion, the spec should be changed to limit ruby nesting to two levels, defining the outer level as a phrase-level annotation; and we should consider supporting multiple <rt>s per base, if there is reason to believe that multiple monoruby annotations at the ends of lines are common. ================================================================================ #18 Michael[tm] Smith 2011-11-20 17:26:17 +0000 -------------------------------------------------------------------------------- Henri, any response to comment #17 from Hixie? ================================================================================ #19 Henri Sivonen 2011-11-21 07:36:25 +0000 -------------------------------------------------------------------------------- I'm not competent to disagree with the i18n group on this topic. fantasai, bz? ================================================================================ #20 fantasai 2011-11-29 00:49:46 +0000 -------------------------------------------------------------------------------- > Jukugo is a stylistic variant of group ruby, This is not true. > again to be handled in CSS. CSS can handle jukugo vs. mono rendering at the stylistic level *iff* both the pairing and the word-boundary information is recorded in the HTML. Group ruby doesn't record any sub-word pairing information, because there isn't any, so you'll have to explain better what you mean by this sentence. > defining the outer level as a phrase-level annotation This doesn't make sense for, e.g. double annotating kanji with both kana and romaji. > we should consider supporting multiple <rt>s per base, if there is reason to > believe that multiple monoruby annotations at the ends of lines are common. I have no idea what this is referring to. ================================================================================ #21 Koji Ishii 2011-12-04 05:57:25 +0000 -------------------------------------------------------------------------------- > Fallback if we rely on this simple pattern is suboptimal, but that doesn't seem > to be a big deal. It's time for implementations to just implement ruby. AT > fallback is not impossible in any of these cases and is unaffected by how we > mark it up. Fallback isn't only for browsers without ruby support. One UA vendor I know considered using fallback when ruby is too small to read, but gave up due to the text quality issue fantasai pointed out. ================================================================================ #22 fantasai 2012-02-20 17:18:42 +0000 -------------------------------------------------------------------------------- http://lists.w3.org/Archives/Public/public-i18n-cjk/2012JanMar/0063.html ================================================================================ -- Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
Received on Wednesday, 18 July 2012 07:25:13 UTC