- From: r12a via GitHub <sysbot+gh@w3.org>
- Date: Fri, 02 Dec 2016 15:05:29 +0000
- To: public-css-archive@w3.org
> The native scripts don't use hyphenations There's still a problem where, say, Latin-script words are regularly embedded in a script that _does_ hyphenate, such as Arabic. You need to apply the right hyphenation rules according to the language of the second script. Here's my thought process: For any hyphenation to take place, the browser needs a hyphenation dictionary. Hyphenation dictionaries and rules are language specific, so you need to know the language of the text you're going to hyphenate. Much romaji text in Japanese content will be things such as acronyms, which won't hyphenate anyway, but not all. Other text may be transliterations of Japanese. It's unlikely that romaji text in Japanese will be marked up for a given language, but it will instead fall under the `ja` used for the document or passage as a whole. This is because often the romaji text is not really considered to be in a different language, just in a different script. Even where the words are clearly, say, English Japanese people don't see it as a separate language in the same way as German embedded in English. Because there's unlikely to be markup, the hyphens property can't be used, because there's no way to tell the language of the non-CJK text. On the other hand, assuming that it's English may work most of the time. There may however also be Japan-specific terms that are not in the standard English dictionary(?). So a hyphenation algorithm that switches dictionaries as the script changes, and includes perhaps some local Latin words, might work for Japanese. Embedding German words/phrases into Japanese content is likely to be much the same as embedding it in English – you'd expect to have to indicate that this uses German hyphenation rules rather than English by marking things up. Likewise, if your content contains text in a range of languages, it's best to mark it up. But, as mentioned earlier, typically foreign language text is not the same as romaji text in Japanese. So we're talking about using a secondary hyphenation language where the script changes. This need not only be for Japanese, it is likely, for example, to be needed for Arabic too. There needs to be a way of knowing which language is typically being embedded in a script - it may not always default to English(?). (For example, for ar-MA it may be French(?).) It may not even be in the Latin script(?). Should one store the information about what language to assume in the browser, or allow the content author to specify it? The latter could also be useful for unusual passages where, say, all the Latin script text is in German, to save time in marking it up. However, if you want hyphenation to occur, not only do you need to guess or indicate the language of the alternate script, but you also need to disable `word-break:break-all` for non-CJK runs of text. So maybe you need (a) a new value, `word-break:break-all-hyphenate`, to make non-CJK text hyphenate (b) a new property, `alt-script-hyphen-lang: <bcp 47 tag>` -- GitHub Notification of comment by r12a Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/785#issuecomment-264474512 using your GitHub account
Received on Friday, 2 December 2016 15:05:35 UTC