Re: word separators (was: Ignoring empty paragraphs)

On Fri, 14 Apr 2000, Daniel Glazman wrote:

> Just FYI and information of the other readers : one of my best friends is
> a latinist. He had to put on an intranet some weeks ago the exact copy of
> a romanian wall inscription where words are separated by a colon.
> Not by whitespaces.

This raises an interesting question about division of text into lines.
The current HTML specifications are very vague about it. (Even HTML 2.0
was more explicit, I'd say.) I think the general - and correct - idea is
that the problem is inherently dependent on the natural language used
in the document, to be solved among other "i18n" issues.

But that's not all. Lots of problems are language-independent, at least
mostly. It would be somewhat artificial to approach the particular
problem mentioned above by introducing a specifier to language code
(say lang="la-inscriptions"). (By the way, I'd say that most inscriptions
don't use a colon but a character that can be identified with the
middle dot character, ·, which has rather mixed usage, see
http://www.hut.fi/u/jkorpela/latin1/3.html#B7 )

It would be desirable to impose _some_ requirements or at least
recommendations about division of text into lines by browsers. They
could try to cover some small but important subset of the line breaking
default rules in Unicode, see
http://www.unicode.org/unicode/reports/tr14/

But even if both such rules and language-sensitive word division
methods will be applied, and _especially_ since it will take long time
before they will be widely useable, some methods for preventing line
breaks _and_ for explicitly allowing line breaks are needed.
(The objection that they should be handled in style sheets would be
theoretical at present, and not very good theory in my opinion; such
issues are difficult to handle in CSS even in principle, since the natural
way to handle them is _interspersed_ markup; and it's questionable whether
the inherent indivisibility or divisibility of a string is a purely
presentational issue.)

Well, such methods actually exist and are widely supported, though not
defined in HTML specifications: WBR and NOBR. They could be defined simply
as phrase level markup and as applicable to textual data only. It's
obviously desirable to be able to specify allowable break points within
long "words". The need for NOBR is not that obvious but see some notes at
http://www.hut.fi/u/jkorpela/html/nobr.html or regard it just a logical
counterpart to WBR. :-)

In a case like the one discussed, WBR would not make (or does not make;
it can actually be used at present, it's just not standardized) things
that simple, but at least one could programmatically insert <wbr>
(or <wbr />) after each colon or middle dot.

-- 
Yucca, http://www.hut.fi/u/jkorpela/ or http://yucca.hut.fi/yucca.html

Received on Friday, 14 April 2000 08:44:43 UTC