Re: [csswg-drafts] [css-text] Questionable Thai words from James Clark via GitHub on 2018-03-22 (public-css-archive@w3.org from March 2018)

From: James Clark via GitHub <sysbot+gh@w3.org>
Date: Thu, 22 Mar 2018 02:52:02 +0000
To: public-css-archive@w3.org
Message-ID: <issue_comment.created-375162188-1521687121-sysbot+gh@w3.org>

I think it would be very appropriate for choice of words breaks to be influenced by line-break:loose/normal/strict.

A fundamental difficulty is that matching against a dictionary is a far from adequate approach to Thai word-breaking. The state of the art today uses machine learning and a corpus. However, I don't know of any corpus that marks up fine-grained distinctions between word boundaries. Maybe it would be possible to figure that out automatically, but that would be a research problem.

Another big problem area is proper names. These are quite challenging because they are composed from multiple words, but shouldn't be broken, and there are no capital letters to distinguish them (instead there are words, such as the equivalent to Mr/Mrs/Miss, that are typically followed by a proper name).

The goals of words segmentation for line-breaking and for editing are a bit different. With line-breaking, you are trying to maximize the number of line-break opportunities without impairing readability. With editing, predictability is important, and you also want units that correspond as often as possible to what a user wants to edit. But really you would need to do user testing to see what people find convenient. My guess is that it would be convenient to have editing-words be longer than line-breaking-words.

--
GitHub Notification of comment by jclark
Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/2455#issuecomment-375162188 using your GitHub account

Received on Thursday, 22 March 2018 02:52:04 UTC