RE: <NOBR> - Returning to the question ( 2 ) from Jukka K. Korpela on 2004-02-29 (www-html@w3.org from February 2004)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Sun, 29 Feb 2004 20:47:45 +0200 (EET)
To: www-html@w3.org
Message-ID: <Pine.GSO.4.58.0402292012460.3247@korppi.cs.tut.fi>
On Sat, 28 Feb 2004, Ernest Cline wrote:

> The part that I was referring to is the fourth paragraph of Section 3
> of UAX#14: [1]
>
> "When expanding or compressing inter-word space, only the space
> marked by U+0020  SPACE and U+3000  IDEOGRAPHIC SPACE are
> normally subject to compression, and only spaces marked by U+0020
> SPACE, and occasionally spaces marked by U+202F  THIN SPACE
> are subject to expansion. All other space characters have fixed width."
>
> I  will agree that it should be better marked out, as this is a ridiculous
> place for putting this requirement,

It's confusing indeed. But it's not a requirement. It is not presented in
a normative language - rather, as a description. Note the words "normally"
and "occasionally", which are clearly descriptive, and the lack of words
like "shall" or even "should".

But it is useful to notice that no-break spaces are more or less _meant_
to have fixed width. This, in turn, is an argument in favor of <nobr>,
which need not imply any such semantics. My point is that no-break space
combines two logically distinct properties into a single character,
inseparably.

> and that this is in an non-normative part of the annex.

So it's surely not a _requirement_, is it=

> However, this also accords rather well with the
> CSS 2.1 definition of whitespace (given in section 4, Syntax, but also
> referred to by the 'white-space' property) which states: [2]

Well, it says that no-break spaces are not "whitespace", which is not
under debate. What I would like to debate is the choice of the word
"whitespace", when used as term that designates a specific set of
characters, excluding characters that most certainly fall under an
intuitive "whitespace" concept.

> None of the major browsers does a good job with line breaking, at present

And this means that for several years from now on, if not longer,
<nobr> and <wbr> are the only practical way to avoid some really
nasty phenomena.

> As for the example, I will note that if instead of U+002D HYPHEN-MINUS
> one uses the unambiguous U+2212 MINUS SIGN, then applying the rules
> of UAX#14 would prohibit that break.

Yes, if those rules were correctly applied. The problem is that the minus
sign has far more limited support than hyphen-minus (or the common
surrogate, en dash, which Unicode defines as allowing a line break after
it).

> Yes, ideally, IE should perform
> a contextual analysis to determine whether the hyphen-minus is acting
> more like a hyphen (class BA) or a minus (class PR)

I hope IE won't try anything like that in the next few decades. Let's hope
it will first realize that a two-character string like "-a" shall not be
broken no matter what. Analyzing what "-a" means is far beyond the scope
of even the most advanced technologies at present.

> If "-a"
> occurred at the start of an element or following a space, I would expect it
> to be treated as a minus, which would handle the common case.

And you would guess wrong probably more often than not. Consider the
simple statement
"Many Latin neuter words have <i>-a</i> as their plural suffix."

> I agree that given current implementations, it is not a good current
> solution, but the use of the class GL characters does have a
> normative effect for line breaking,

I'm not sure which of the following arguments is more importantly, but
when combined, I think they are rather convincing:
1. Unicode line breaking rules are a mess, and an attempt to solve
   problems at a wrong protocol level, and they mostly just
   _create_ problems - especially if taken blindly and not just
   as a suggested "neutral base" to be modified by application needs.
2. Those rules have not actually been implemented in browsers, except
   for a random subset and with lots of bugs. (This naturally relates
   the their being a mess.)
3. They are extremely hard to understand and awkward to apply in
   authoring.

> Well, if there ever is an IE 7, I would expect it to fully support the
> normative portions of UAX#14,

I wouldn't.

BTW, is there some specific requirement in some HTML or XML specification
that says that user agents shall apply Unicode standard semantics?
I thought the normative references pointed to ISO 10646 and did not
include specific requirements concerning processing of characters
in general. If such a requirement is made, I wonder how it should really
be understood. For example, do pre-2002 specifications really mandate that
WJ be treated as defined by the Unicode standard in 2002?

> Opera supports this portion of the standard, but, it does
> suffer from the same bug concerning the glyphs for CGJ and WJ
> that IE does.

I seriously doubt whether Opera's support is even close to the standard,
but it suffices to state that if a browser does not adequately process the
Unicode  control characters affecting line breaking, it fails to conform
in a most essential way. This should not depend on font issues.

>  Given that these two characters were only added in
> Unicode 3.2.  I would expect that programs that rely entirely upon
> the OS for glyph information will have to wait until new OS versions
> that are aware of these characters are released to work correctly.

Pardon? How would glyphs affect an issue that revolves around the
principle of treating certain characters are invisible control codes?
Surely the most trivial part of support to them would be to refrain from
any attempts to display them using any font.

> Actually, given a bare-bones dumb implementation of UAX#14,
> which is not difficult to implement:
> [?&#xfeff;%&x#feff;x-1&x#feff;+2]
> would suffice.

Perhaps. My point was that browsers implement line breaking rules so
poorly that if you're about to throw those control characters into your
markup, making it virtually write-only, you might just as well put them
everywhere to guard against browser bugs.

> However, for such a case as you gave, something like:
> <code>[?%x-1+2]</code>
> with the exact desired presentation being supplied by CSS is likely
> to preferable.  Anything that needs that much glue, probably needs
> it for a reason attributable to the desired presentation of a semantic
> element such as <code>, <a>, or even <span class="someclass">.

None of those markup elements implies non-breaking behavior in any way.
You are now suggesting that the non-breakability is a purely
presentational issue. Would this extend, for example, to the statement
"the URL-encoded form of a space is %20"? That is, would it be quite OK,
except for esthetics, to insert a line break after the "%" character?
Using <code>%20</code> (in addition to being perhaps debatable
semantically) would not say that %20 must not be broken.

> The case that the original querent inquired about was for a justifying
> non-breaking space.  For this, the simple case of WJ SP WJ is easy
> to code.

To some people, maybe. But in the vast majority of browsing environments,
it produces just a mess.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Sunday, 29 February 2004 13:47:48 UTC