Re: [csswg-drafts] [css-text-3] Hyphenation usages in CJK from r12a via GitHub on 2016-12-02 (public-css-archive@w3.org from December 2016)

From: r12a via GitHub <sysbot+gh@w3.org>
Date: Fri, 02 Dec 2016 15:05:29 +0000
To: public-css-archive@w3.org
Message-ID: <issue_comment.created-264474512-1480691127-sysbot+gh@w3.org>

> The native scripts don't use hyphenations

There's still a problem where, say, Latin-script words are regularly 
embedded in a script that _does_ hyphenate, such as Arabic.  You need 
to apply the right hyphenation rules according to the language of the 
second script.

Here's my thought process:

For any hyphenation to take place, the browser needs a hyphenation 
dictionary. Hyphenation dictionaries and rules are language specific, 
so you need to know the language of the text you're going to 
hyphenate.

Much romaji text in Japanese content will be things such as acronyms, 
which won't hyphenate anyway, but not all.  Other text may be 
transliterations of Japanese.

It's unlikely that romaji text in Japanese will be marked up for a 
given language, but it will instead fall under the `ja` used for the 
document or passage as a whole.  This is because often the romaji text
 is not really considered to be in a different language, just in a 
different script. Even where the words are clearly, say, English 
Japanese people don't see it as a separate language in the same way as
 German embedded in English. 

Because there's unlikely to be markup, the hyphens property can't be 
used, because there's no way to tell the language of the non-CJK text.

On the other hand, assuming that it's English may work most of the 
time.  There may however also be Japan-specific terms that are not in 
the standard English dictionary(?).  So a hyphenation algorithm that 
switches dictionaries as the script changes, and includes perhaps some
 local Latin words, might work for Japanese.

Embedding German words/phrases into Japanese content is likely to be 
much the same as embedding it in English – you'd expect to have to 
indicate that this uses German hyphenation rules rather than English 
by marking things up.  Likewise, if your content contains text in a 
range of languages, it's best to mark it up. But, as mentioned 
earlier, typically foreign language text is not the same as romaji 
text in Japanese.

So we're talking about using a secondary hyphenation language where 
the script changes.  This need not only be for Japanese, it is likely,
 for example, to be needed for Arabic too.  There needs to be a way of
 knowing which language is typically being embedded in a script - it 
may not always default to English(?).  (For example, for ar-MA it may 
be French(?).) It may not even be in the Latin script(?). Should one 
store the information about what language to assume in the browser, or
 allow the content author to specify it?  The latter could also be 
useful for unusual passages where, say, all the Latin script text is 
in German, to save time in marking it up.

However, if you want hyphenation to occur, not only do you need to 
guess or indicate the language of the alternate script, but you also 
need to disable `word-break:break-all` for non-CJK runs of text.

So maybe you need 
(a) a new value, `word-break:break-all-hyphenate`, to make non-CJK 
text hyphenate
(b) a new property, `alt-script-hyphen-lang: <bcp 47 tag>`


-- 
GitHub Notification of comment by r12a
Please view or discuss this issue at 
https://github.com/w3c/csswg-drafts/issues/785#issuecomment-264474512 
using your GitHub account

Received on Friday, 2 December 2016 15:05:35 UTC