RE: <NOBR> - Returning to the question ( 2 ) from Ernest Cline on 2004-03-01 (www-html@w3.org from March 2004)

From: Ernest Cline <ernestcline@mindspring.com>
Date: Mon, 1 Mar 2004 03:00:33 -0500
To: "Jukka K. Korpela" <jkorpela@cs.tut.fi>, www-html@w3.org
Message-ID: <410-22004311803346@mindspring.com>
> [Original Message]
> From: Jukka K. Korpela <jkorpela@cs.tut.fi>

> > As for the example, I will note that if instead of U+002D HYPHEN-MINUS
> > one uses the unambiguous U+2212 MINUS SIGN, then applying the rules
> > of UAX#14 would prohibit that break.
>
> Yes, if those rules were correctly applied. The problem is that the minus
> sign has far more limited support than hyphen-minus (or the common
> surrogate, en dash, which Unicode defines as allowing a line break after
> it).

Limited support?
Perhaps in non Unicode based implementations, but I find it difficult
to believe that there is not a Unicode-based HTML/XML implementation
that does not have U+2212 defined in at least one font available to it.

> > Yes, ideally, IE should perform
> > a contextual analysis to determine whether the hyphen-minus is acting
> > more like a hyphen (class BA) or a minus (class PR)
>
> I hope IE won't try anything like that in the next few decades. Let's hope
> it will first realize that a two-character string like "-a" shall not be
> broken no matter what. Analyzing what "-a" means is far beyond the scope
> of even the most advanced technologies at present.
>
> > If "-a"
> > occurred at the start of an element or following a space, I would
expect it
> > to be treated as a minus, which would handle the common case.
>
> And you would guess wrong probably more often than not. Consider the
> simple statement
> "Many Latin neuter words have <i>-a</i> as their plural suffix."

When "-" is treated as a minus, "-a" shouldn't break since by LB 18,
PR x AL, Granted technically, it's an introductory hyphen in the example
you gave which also should be treated as class PR also.  Still, until
at least several years after this becomes normative behavior,
(if it ever does) it would not be wise to depend upon it.

We aren't disagreeing too much on what should be done with what
is available in today's UA's; where we disagree is on what goal
should be set for the future.

> > I agree that given current implementations, it is not a good current
> > solution, but the use of the class GL characters does have a
> > normative effect for line breaking,
>
> I'm not sure which of the following arguments is more importantly, but
> when combined, I think they are rather convincing:
> 1. Unicode line breaking rules are a mess, and an attempt to solve
>    problems at a wrong protocol level, and they mostly just
>    _create_ problems - especially if taken blindly and not just
>    as a suggested "neutral base" to be modified by application needs.
> 2. Those rules have not actually been implemented in browsers, except
>    for a random subset and with lots of bugs. (This naturally relates
>    the their being a mess.)
> 3. They are extremely hard to understand and awkward to apply in
>    authoring.
>
> > Well, if there ever is an IE 7, I would expect it to fully support the
> > normative portions of UAX#14,
>
> I wouldn't.

Why? The normative portion isn't all that onerous, involving only
eighteen specific characters, the surrogates, the controls, and
the combining marks. None of the normative portion requires
any sort of contextual analysis to make the proper determination.
It's the non-normative guidance which often requires contextual
analysis to get things right that is the problem.  The entirety of the
rest of the characters have no normative breaking assigned to them.


> BTW, is there some specific requirement in some HTML or XML specification
> that says that user agents shall apply Unicode standard semantics?
> I thought the normative references pointed to ISO 10646 and did not
> include specific requirements concerning processing of characters
> in general. If such a requirement is made, I wonder how it should really
> be understood. For example, do pre-2002 specifications really mandate that
> WJ be treated as defined by the Unicode standard in 2002?

The XML specifications basically consider ISO 10646 and Unicode to be
much the same thing and include both as normative references.

XML 1.0 tied itself to a specific Unicode/ISO 10646 version, but with
each edition updated to the version current at that time. (1st edition
to Unicode 2.0, 2nd ed. to 3.0 and 3rd ed. to 3.1)  On the other hand,
XML 1.1 takes a different philosophy, and calls for following the
most up to date version of Unicode. Ever since expanding from
ISO-8859-1 to Unicode in HTML 4, HTML has referred to both
ISO 10646 and Unicode, and specifically refers to the bidirectional
text algorithm as coming from Unicode and not from ISO 10646.

(The character set was specified as coming from ISO 10646,
and IIRC at the time it was thought that there still might be the
possibility that Unicode would only be a portion of the
ISO 10646 character set.)

> > Opera supports this portion of the standard, but, it does
> > suffer from the same bug concerning the glyphs for CGJ and WJ
> > that IE does.
>
> I seriously doubt whether Opera's support is even close to the standard,
> but it suffices to state that if a browser does not adequately process the
> Unicode  control characters affecting line breaking, it fails to conform
> in a most essential way. This should not depend on font issues.
>
> >  Given that these two characters were only added in
> > Unicode 3.2.  I would expect that programs that rely entirely upon
> > the OS for glyph information will have to wait until new OS versions
> > that are aware of these characters are released to work correctly.
>
> Pardon? How would glyphs affect an issue that revolves around the
> principle of treating certain characters are invisible control codes?
> Surely the most trivial part of support to them would be to refrain from
> any attempts to display them using any font.

The line breaking characteristics of characters are independent of
whether they have a visible representation.  Not all class GL
characters are invisible.  Special casing the invisible class GL
characters would provide a more robust implementation, but
it would also slow it down needlessly if the OS and available
fonts handled the display correctly.  I can understand Opera
deciding that this should be a system and not an application issue,
altho given the reality of what is as opposed to what should be,
it should special case it until the OSes that it is intended to run on
correctly handle it.

> > Actually, given a bare-bones dumb implementation of UAX#14,
> > which is not difficult to implement:
> > [?&#xfeff;%&x#feff;x-1&x#feff;+2]
> > would suffice.
>
> Perhaps. My point was that browsers implement line breaking rules so
> poorly that if you're about to throw those control characters into your
> markup, making it virtually write-only, you might just as well put them
> everywhere to guard against browser bugs.
>
> > However, for such a case as you gave, something like:
> > <code>[?%x-1+2]</code>
> > with the exact desired presentation being supplied by CSS is likely
> > to preferable.  Anything that needs that much glue, probably needs
> > it for a reason attributable to the desired presentation of a semantic
> > element such as <code>, <a>, or even <span class="someclass">.
>
> None of those markup elements implies non-breaking behavior in any way.
> You are now suggesting that the non-breakability is a purely
> presentational issue. Would this extend, for example, to the statement
> "the URL-encoded form of a space is %20"? That is, would it be quite OK,
> except for esthetics, to insert a line break after the "%" character?
> Using <code>%20</code> (in addition to being perhaps debatable
> semantically) would not say that %20 must not be broken.

I would say that <code> is certainly appropriate for such an example.
As for indicating that you want your codes to not use line breaks,
I'd say that is a job for styling.  If <nobr> belongs in, then so do
other presentational elements such as <b> and <i>.

> > The case that the original querent inquired about was for a justifying
> > non-breaking space.  For this, the simple case of WJ SP WJ is easy
> > to code.
>
> To some people, maybe. But in the vast majority of browsing environments,
> it produces just a mess.

At present, with current user agents, it does. In a few years I think that
it will
be safe to say that for the majority of users it will be a usable solution,
and in about five years only people using antiquated software will
need to worry about it, except that such software won't be able to handle
XHTML 2 in the first place. I do not consider it unreasonable to assume
that the normative portions of UAX #14 will also be implemented by those
user agents that can implement XHTML2.  The real fly in the ointment will
be those user agents that rely upon the OS for handling the rendering
of U+2060 as those agents will suffer from the problems we have both
described so verbosely unless the OS is updated to handle it correctly.
Received on Monday, 1 March 2004 03:00:36 UTC