W3C home > Mailing lists > Public > public-i18n-cjk@w3.org > January to March 2012

Re: HTML5 and ruby

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Wed, 18 Jan 2012 22:27:42 +0100
To: Eric Muller <emuller@adobe.com>
Cc: public-i18n-cjk@w3.org
Message-ID: <20120118222742964393.ae1b06b3@xn--mlform-iua.no>
Eric Muller, Wed, 18 Jan 2012 10:54:00 -0800:

> On 1/12/2012 5:03 PM, Leif Halvard Silli wrote:
>>> Both
>>> <ruby><rb>東</rb><rt>とう</rt><rb>  京<  /rb><rt>きょう</rt></ruby>  (may
>>> be with a different interleaving of rbs and rts)
>>> and
>>> <ruby>東<rt>とう</rt>京<rt>きょう</rt><  /ruby>
>>> capture the list of pairs {東, とう}, {京, きょう} equally well.
>> Why is *any* of the two examples above any better than this:
>> <ruby><rb>東</rb><rt>とう</rt></ruby><ruby><rb>  京<  /rb><rt>きょう
>> </rt></ruby>
  ... snip ...
> Given the same characters, and in fact the same pairs, the decision 
> to treat those pairs as jukugo or not is based on the semantics of 
> the text. It seemed obvious to me that using a single <ruby> vs 
> multiple <ruby> was the only way to go, but you are right that I did 
> not made that clear.

So, you say that a compound jukugo needs to be kept together as a 
compound. And that that compound wrapper is <ruby>. My claim, however, 
is that that is not enough: The <rt> between each <rb> splits the 
compound up, preventing e.g. spellcheckers from recognizing the 
compound as a compound.

> Note that this is not entirely different from the underline problem 
> which was discussed on www-style not too long ago: <u>A</u><u>BC</u> 
> is considered distinct from <u>ABC</u> (and a fortiori from three 
> successive <u>), especially in CJK world. The underline is used on 
> names, and reflecting the parts of the name, as in  (A)(BC) is deemed 
> important.

Like <u> is not intended for alphabetic usage only, <ruby> is not 
intended for Japanese only.  W.r.t. to <u>, then - for e.g. Latin text 
- what's the difference between <u>f</u><u>oo</u> and <u>foo</u>? Would 
the spell checker not recognize both of them as one and the same word? 

OTOH, if one did this: <u lang=en >f</u><u lang=ru >oo</u>, then a 
sensitive spellchecker would not see it as the word 'foo' but as two 
words: The English word 'f' and the Russian word 'oo'... And, this 
allows me to jump back to the initial question: Provided I understood 
you correctly, then - as already stated above - I am not convinced by 
your argument. You see, the underlying claim of my question is that it 
doesn't matter whether you write 

1. <ruby><rb>東</rb><rt>とう</rt><rb>京</rb><rt>きょう</rt></ruby>,
2. <ruby>東<rt>とう</rt>京<rt>きょう</rt></ruby> or
3. <ruby><rb>東</rb><rt>とう</rt></ruby><ruby><rb>京</rb><rt>きょう

I believe, that for the spellchecker to perceive *words*, including 
compound 'jukugos', then the word must EITHER be kept inside a single 
<rb> or inside two or more adjacent <rb>s. If there needs to be a 
wrapper, then it could be <ruby> or <rbc>.

The problem, however, is that HTML5 allows us to do:


But this permission does not seem meaningful: The word 'foo' in the 
above example, would not be possible for a spell checker to detect.

The meaningful answer, would be to, first and foremost *forbid* the 
above construct. Meaning that <ruby> would only permit a single pair of 
<rb><rt>: <ruby><rb><rt></ruby>. But two or more pairs [such as this:  
<ruby><rb><rt><rb><rt></ruby>] would be forbidden. Secondly, one would 
need to EITHER take back <rbc> OR to permit two or more adjacent <rb> 
as direct child of <ruby>. Thus, either this:


And/Or this:


It also seems to me that <rt> by definition should be seen as a word 
separator: Each <rt> includes one or more words. Whereas <rb> is not 
seen as a word separator. 

Leif Halvard Silli
Received on Wednesday, 18 January 2012 21:28:32 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 15:59:16 UTC