Re: [jlreq] Proposal from Eric Muller: re: expanding JLReq character class to Unicode (#242) from Eric Muller via GitHub on 2020-10-30 (public-i18n-archive@w3.org from October to December 2020)

From: Eric Muller via GitHub <sysbot+gh@w3.org>
Date: Fri, 30 Oct 2020 23:08:01 +0000
To: public-i18n-archive@w3.org
Message-ID: <issue_comment.created-719838122-1604099280-sysbot+gh@w3.org>
> >    grapheme cluster vs grapheme cluster occurence
> 
> What is a difference between A and A occurence in this context? (may be this is simply a novice English question)

In the string "moto", we have the graphemes "m", "o" and "t". The grapheme "o" occurs twice. In this particular example, there is no reason to treat the two occurrences differently, but it may be useful in other cases. For example, a U+2019 could occur (along with a U+2018) to quote some Japanese text, in which case that occurrence could be cl-02 closing bracket, and there could be another occurrence, in the same line, where it is apostrophe in an English word (e.g. in "don’t"); that other occurrence could be a cl-27 western character.

>What is the benefit of having different classes between horizontal and vertical? I would appreciate it if you could elaborate here a bit.

Take the case of U+00C6 Æ. It is my understanding that when in horizontal text (or in sideways in vertical text), the most likely desired behavior is cl-27 western char, whereas when it is upright in vertical text, the most likely desired behavior is cl-19 ideographic character.

> I could not figure out what “not carefully on punctuation” means…

Consider some text:  [japanese]“[english]”[japanese]. If we use the locale to determine the class of the occurrences of “ and ”, then we are making a distinction between:

<span xml:lang='ja'>[japanese]<span xml:lang='en'>“[english]”</span>[japanese]</span>

and 

<span xml:lang='ja'>[japanese]“<span xml:lang='en'>[english]</span>”[japanese]<span>

i.e. on whether the quotes are inside or outside the English span. I don't think we can rely an authors to master the difference between the two, especially if they use a wysiwyg editor, where the difference is difficult to "see".

> I can see why separating each inseparable makes sense, because that way you can construct a state machine.

Yes. In JLREQ, right now, this is handled by a footnote. But every footnote means an "exception" in the code. 

> Regarding your second reasoning, i.e. ability to define different glues, do you have concrete examples of why such ability is a good thing?

More flexibility for the book designer can only be a good thing. As a concrete example, consider the pair <ideograph> <emdash>. The glue between those two can grow for justification purposes. It seems reasonable to have the glue between <vertical kana repeat mark lower half> and <emdash> grow in the same way; but because they are currently both inseparable, you can't express that (or you have to use a footnote...).

 >> As noted earlier, markup should always be available to influence the determination of the glue. Thus there is no need for such a Unicode property to be perfect; it does however need to be easily accessible and fairly stable.

> I am not sure if I understood the discussion here. There will be systems / applications where such markup is not possible (due to limitation of the underlaying engine, or limitation of the UI), and if so, existence of a markup could not be the reason why the property does not need to be perfect. I am not necessarily saying it needs to be perfect, as it can’t be.

I definitely have a bias for environments with markup (whether edited directly - e.g. HTML in a text editor - or indirectly - e.g. InDesign, with UI to effectively edit the underlying markup). But I also think it is the dominant use case (even here in github I have some markup). And I am not sure you one go very far without markup, especially in Japanese typography.

I do however think that both plain text and markup need something accessible and stable more than something perfect.

> Do you have examples of the glue design giving influence on the design of the character classes?

Keeping in mind that I excluded line breaking, the classes serve *only* to determine what glue to use. So if you completely ignore the glue, there is no need to have classes. A bit more concrete: if Hiragana and Katakana always have the same glue behavior, then there is no need to have two different classes. When defining JLREQ, the classes are a consequence of the glues (of course, when implementing JLREQ, the glues are a consequence of the classes)

> Do you suggest that there will be a part of text where this model is applied and another part of text where it is not applied, instead of defining everything in one spacing model? If you have multiple models, especially in one document, it would become harder to for example set an uniform style, e.g. indentation.

The use case I have in mind is a mixed language documents, eg. Japanese books with extensive (paragraph level) quotes in English, or dual language (e.g. left page in Japanese, right page in English). 

Nobody is forced to use different mojikumi settings (or even Japanese vs. non Japanese spacing) in a document, but some people need to.

> I have not yet carefully looked at the character class assignments you proposed but how did you handle JLReq classes that are dependent on the layout, such as ruby base (cl-20, 21, 22, 23, 24, 25, 28, 29, 30)?

In general, the context will first determine the class of an occurrence. When the context does not provide the answer, then we go to the mapping character → class. So when we have "...X<ruby>...</ruby>...", at the position between X and <ruby>, the class on the left comes from the class of X, and the class on the right comes from <ruby>, i.e this is a cl-27 / cl-22. In this particular case, the markup had another purpose, but one could have markup where the only effect would be to set the classes. 

To be more complete, because we are talking presentation, CSS rather than HTML would be the proper place, and the default HTML stylesheet would have something like "ruby { jlreq-class: cl22 }". (well; it's a bit more complicated, but you get the idea).

>   0x000021 | R | H | westernChar
>    0x000080 | R | H | V | unknown
>
> Is it correct to interpret that code points between 0x000021 and 0x000080 follow properties specified by 0x000021?

Yes. 

Eric.

PS: thank for helping with the translation


-- 
GitHub Notification of comment by emuller-amazon
Please view or discuss this issue at https://github.com/w3c/jlreq/issues/242#issuecomment-719838122 using your GitHub account


-- 
Sent via github-notify-ml as configured in https://github.com/w3c/github-notify-ml-config
Received on Friday, 30 October 2020 23:08:03 UTC